model-index:
- name: BEDAI-2B
results:
task: type: multiple-choice name: Exams (TR) dataset: name: exams_tr type: exams_tr args: {split: validation} metrics:
- name: accuracy_norm type: accuracy value: 25.70
task: type: question-answering-extractive name: TQuAD (TR) dataset: name: tquad type: tquad args: {split: validation} metrics:
- name: exact_match type: exact_match value: 9.9807
- name: f1 type: f1 value: 22.9314
task: type: question-answering-extractive name: XQuAD (TR) dataset: name: xquad_tr type: xquad_tr args: {split: validation} metrics:
- name: exact_match type: exact_match value: 6.4706
- name: f1 type: f1 value: 13.0114
task: type: text-classification name: Turkish PLU (overall) dataset: name: turkish_plu type: turkish_plu args: {split: test} metrics:
- name: accuracy_norm type: accuracy value: 51.58
Evaluation (CETVEL โ Turkish subsets)
Raw artifacts: nurcunal/BEDAI-2B-cetvel-2025-10-31
This quick sweep covers MCQA (exams_tr), QA (mean F1 of tquad + xquad_tr), and TC (turkish_plu acc_norm).
BEDAI-2B (this run): MCQA 25.70, QA 17.97, TC 51.58
| Model | MCQA | QA | TC |
|---|---|---|---|
| BEDAI-2B (this work) | 25.70 | 17.97 | 51.58 |
| CohereLabs__aya-expanse-32b | 52.47 | 20.48 | 50.67 |
| CohereLabs__aya-expanse-8b | 44.09 | 0.19 | 50.03 |
| google__gemma-2-9b-it | 48.20 | 4.46 | 45.38 |
| google__gemma-3-12b-it | 52.66 | 10.26 | 54.38 |
| google__gemma-3-27b-it | 55.40 | 10.56 | 53.65 |
| google__gemma-3-4b-it | 42.33 | 8.22 | 46.15 |
| Kumru-2B | 39.69 | 6.50 | 47.57 |
| Llama-3.1-8B-Instruct | 45.77 | 38.99 | 46.51 |
| Llama-3.3-70B-Instruct | 60.70 | 23.97 | 63.73 |
| meta-llama__Llama-3.2-11B-Vision-Instruct | 45.66 | 4.37 | 47.88 |
| meta-llama__Llama-3.2-3B-Instruct | 37.00 | 7.52 | 39.00 |
| Qwen__Qwen2-72B-Instruct | 61.27 | 0.83 | 60.47 |
| Qwen__Qwen2-7B-Instruct | 49.66 | 1.53 | 52.52 |
| Trendyol__Llama-3-Trendyol-LLM-8b-chat-v2.0 | 53.28 | 0.17 | 54.06 |
| Trendyol__Trendyol-LLM-7B-chat-v4.1.0 | 54.94 | 0.34 | 52.12 |
| ytu-ce-cosmos__Turkish-Gemma-9b-v0.1 | 51.85 | 11.11 | 46.97 |
| ytu-ce-cosmos__turkish-gpt2-large-750m-instruct-v0.1 | 35.20 | 0.28 | 52.77 |
Notes
โข QA = mean F1 over TQuAD and XQuAD-TR for this run.
โข CETVEL has more tasks (GEC/MT/NLI/SUM); this compares shared Turkish subsets only.
โข For reproducibility, see the dataset repo above and the exact command used.
- Downloads last month
- -