BasqueBench
Paper
BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.
The new evaluation datasets included in BasqueBench are:
| Task | Category | Homepage |
|---|---|---|
| MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
| PIQA_eu | Question Answering | https://huggingface.co/datasets/HiTZ/PIQA-eu |
| WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
| XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |
The datasets included in BasqueBench that have been made public in previous pubications are:
Citation
Paper for BasqueBench coming soon.
Groups and Tasks
Groups
basque_bench: All tasks included in BasqueBench.flores_eu: All FLORES translation tasks from or to Basque.
Tasks
The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
belebele_eus_Latneus_exams_eueus_proficiencyeus_readingeus_triviaflores_euflores_eu-caflores_eu-deflores_eu-enflores_eu-esflores_eu-frflores_eu-glflores_eu-itflores_eu-ptflores_ca-euflores_de-euflores_en-euflores_es-euflores_fr-euflores_gl-euflores_it-euflores_pt-eumgsm_direct_eumgsm_native_cot_eupiqa_euqnlieuwnli_euxcopa_euxnli_euxnli_eu_nativexstorycloze_eu
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
belebele_eus_Latn: Belebele Basqueqnlieu: From BasqueGLUE
Checklist
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
- Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
- Is the "Main" variant of this task clearly denoted?
- Have you provided a short sentence in a README on what each new variant adds / evaluates?
- Have you noted which, if any, published evaluation setups are matched by this variant?