model-index: - name: BEDAI-2B results: - task: type: multiple-choice name: Exams (TR) dataset: name: exams_tr type: exams_tr args: {split: validation} metrics: - name: accuracy_norm type: accuracy value: 25.70 - task: type: question-answering-extractive name: TQuAD (TR) dataset: name: tquad type: tquad args: {split: validation} metrics: - name: exact_match type: exact_match value: 9.9807 - name: f1 type: f1 value: 22.9314 - task: type: question-answering-extractive name: XQuAD (TR) dataset: name: xquad_tr type: xquad_tr args: {split: validation} metrics: - name: exact_match type: exact_match value: 6.4706 - name: f1 type: f1 value: 13.0114 - task: type: text-classification name: Turkish PLU (overall) dataset: name: turkish_plu type: turkish_plu args: {split: test} metrics: - name: accuracy_norm type: accuracy value: 51.58 ## Evaluation (CETVEL – Turkish subsets) Raw artifacts: **[nurcunal/BEDAI-2B-cetvel-2025-10-31](https://huggingface.co/datasets/nurcunal/BEDAI-2B-cetvel-2025-10-31)** This quick sweep covers **MCQA** (`exams_tr`), **QA** (mean F1 of `tquad` + `xquad_tr`), and **TC** (`turkish_plu` acc_norm). **BEDAI-2B (this run):** MCQA **25.70**, QA **17.97**, TC **51.58**

Model	MCQA	QA	TC
BEDAI-2B (this work)	25.70	17.97	51.58
CohereLabs__aya-expanse-32b	52.47	20.48	50.67
CohereLabs__aya-expanse-8b	44.09	0.19	50.03
google__gemma-2-9b-it	48.20	4.46	45.38
google__gemma-3-12b-it	52.66	10.26	54.38
google__gemma-3-27b-it	55.40	10.56	53.65
google__gemma-3-4b-it	42.33	8.22	46.15
Kumru-2B	39.69	6.50	47.57
Llama-3.1-8B-Instruct	45.77	38.99	46.51
Llama-3.3-70B-Instruct	60.70	23.97	63.73
meta-llama__Llama-3.2-11B-Vision-Instruct	45.66	4.37	47.88
meta-llama__Llama-3.2-3B-Instruct	37.00	7.52	39.00
Qwen__Qwen2-72B-Instruct	61.27	0.83	60.47
Qwen__Qwen2-7B-Instruct	49.66	1.53	52.52
Trendyol__Llama-3-Trendyol-LLM-8b-chat-v2.0	53.28	0.17	54.06
Trendyol__Trendyol-LLM-7B-chat-v4.1.0	54.94	0.34	52.12
ytu-ce-cosmos__Turkish-Gemma-9b-v0.1	51.85	11.11	46.97
ytu-ce-cosmos__turkish-gpt2-large-750m-instruct-v0.1	35.20	0.28	52.77

> **Notes** > • QA = mean F1 over **TQuAD** and **XQuAD-TR** for this run. > • CETVEL has more tasks (GEC/MT/NLI/SUM); this compares shared Turkish subsets only. > • For reproducibility, see the dataset repo above and the exact command used.