| model-index: | |
| - name: BEDAI-2B | |
| results: | |
| - task: | |
| type: multiple-choice | |
| name: Exams (TR) | |
| dataset: | |
| name: exams_tr | |
| type: exams_tr | |
| args: {split: validation} | |
| metrics: | |
| - name: accuracy_norm | |
| type: accuracy | |
| value: 25.70 | |
| - task: | |
| type: question-answering-extractive | |
| name: TQuAD (TR) | |
| dataset: | |
| name: tquad | |
| type: tquad | |
| args: {split: validation} | |
| metrics: | |
| - name: exact_match | |
| type: exact_match | |
| value: 9.9807 | |
| - name: f1 | |
| type: f1 | |
| value: 22.9314 | |
| - task: | |
| type: question-answering-extractive | |
| name: XQuAD (TR) | |
| dataset: | |
| name: xquad_tr | |
| type: xquad_tr | |
| args: {split: validation} | |
| metrics: | |
| - name: exact_match | |
| type: exact_match | |
| value: 6.4706 | |
| - name: f1 | |
| type: f1 | |
| value: 13.0114 | |
| - task: | |
| type: text-classification | |
| name: Turkish PLU (overall) | |
| dataset: | |
| name: turkish_plu | |
| type: turkish_plu | |
| args: {split: test} | |
| metrics: | |
| - name: accuracy_norm | |
| type: accuracy | |
| value: 51.58 | |
| ## Evaluation (CETVEL – Turkish subsets) | |
| Raw artifacts: **[nurcunal/BEDAI-2B-cetvel-2025-10-31](https://huggingface.co/datasets/nurcunal/BEDAI-2B-cetvel-2025-10-31)** | |
| This quick sweep covers **MCQA** (`exams_tr`), **QA** (mean F1 of `tquad` + `xquad_tr`), and **TC** (`turkish_plu` acc_norm). | |
| **BEDAI-2B (this run):** MCQA **25.70**, QA **17.97**, TC **51.58** | |
| <table> | |
| <thead> | |
| <tr><th style="text-align:left">Model</th><th>MCQA</th><th>QA</th><th>TC</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr><th style="text-align:left">BEDAI-2B (this work)</th> | |
| <td style="background:#f4cccc">25.70</td> | |
| <td style="background:#f8cbad">17.97</td> | |
| <td style="background:#ffeb9c">51.58</td></tr> | |
| <tr><th style="text-align:left">CohereLabs__aya-expanse-32b</th> | |
| <td style="background:#ffeb9c">52.47</td> | |
| <td style="background:#f8cbad">20.48</td> | |
| <td style="background:#ffeb9c">50.67</td></tr> | |
| <tr><th style="text-align:left">CohereLabs__aya-expanse-8b</th> | |
| <td style="background:#f8cbad">44.09</td> | |
| <td style="background:#f4cccc">0.19</td> | |
| <td style="background:#ffeb9c">50.03</td></tr> | |
| <tr><th style="text-align:left">google__gemma-2-9b-it</th> | |
| <td style="background:#ffeb9c">48.20</td> | |
| <td style="background:#f4cccc">4.46</td> | |
| <td style="background:#f8cbad">45.38</td></tr> | |
| <tr><th style="text-align:left">google__gemma-3-12b-it</th> | |
| <td style="background:#ffeb9c">52.66</td> | |
| <td style="background:#f4cccc">10.26</td> | |
| <td style="background:#ffeb9c">54.38</td></tr> | |
| <tr><th style="text-align:left">google__gemma-3-27b-it</th> | |
| <td style="background:#c6efce">55.40</td> | |
| <td style="background:#f4cccc">10.56</td> | |
| <td style="background:#ffeb9c">53.65</td></tr> | |
| <tr><th style="text-align:left">google__gemma-3-4b-it</th> | |
| <td style="background:#f8cbad">42.33</td> | |
| <td style="background:#f4cccc">8.22</td> | |
| <td style="background:#f8cbad">46.15</td></tr> | |
| <tr><th style="text-align:left">Kumru-2B</th> | |
| <td style="background:#f8cbad">39.69</td> | |
| <td style="background:#f4cccc">6.50</td> | |
| <td style="background:#ffeb9c">47.57</td></tr> | |
| <tr><th style="text-align:left">Llama-3.1-8B-Instruct</th> | |
| <td style="background:#ffeb9c">45.77</td> | |
| <td style="background:#c6efce">38.99</td> | |
| <td style="background:#f8cbad">46.51</td></tr> | |
| <tr><th style="text-align:left">Llama-3.3-70B-Instruct</th> | |
| <td style="background:#c6efce">60.70</td> | |
| <td style="background:#ffeb9c">23.97</td> | |
| <td style="background:#c6efce">63.73</td></tr> | |
| <tr><th style="text-align:left">meta-llama__Llama-3.2-11B-Vision-Instruct</th> | |
| <td style="background:#ffeb9c">45.66</td> | |
| <td style="background:#f4cccc">4.37</td> | |
| <td style="background:#f8cbad">47.88</td></tr> | |
| <tr><th style="text-align:left">meta-llama__Llama-3.2-3B-Instruct</th> | |
| <td style="background:#f8cbad">37.00</td> | |
| <td style="background:#f4cccc">7.52</td> | |
| <td style="background:#f4cccc">39.00</td></tr> | |
| <tr><th style="text-align:left">Qwen__Qwen2-72B-Instruct</th> | |
| <td style="background:#c6efce">61.27</td> | |
| <td style="background:#f4cccc">0.83</td> | |
| <td style="background:#c6efce">60.47</td></tr> | |
| <tr><th style="text-align:left">Qwen__Qwen2-7B-Instruct</th> | |
| <td style="background:#ffeb9c">49.66</td> | |
| <td style="background:#f4cccc">1.53</td> | |
| <td style="background:#ffeb9c">52.52</td></tr> | |
| <tr><th style="text-align:left">Trendyol__Llama-3-Trendyol-LLM-8b-chat-v2.0</th> | |
| <td style="background:#c6efce">53.28</td> | |
| <td style="background:#f4cccc">0.17</td> | |
| <td style="background:#c6efce">54.06</td></tr> | |
| <tr><th style="text-align:left">Trendyol__Trendyol-LLM-7B-chat-v4.1.0</th> | |
| <td style="background:#c6efce">54.94</td> | |
| <td style="background:#f4cccc">0.34</td> | |
| <td style="background:#ffeb9c">52.12</td></tr> | |
| <tr><th style="text-align:left">ytu-ce-cosmos__Turkish-Gemma-9b-v0.1</th> | |
| <td style="background:#ffeb9c">51.85</td> | |
| <td style="background:#f4cccc">11.11</td> | |
| <td style="background:#f8cbad">46.97</td></tr> | |
| <tr><th style="text-align:left">ytu-ce-cosmos__turkish-gpt2-large-750m-instruct-v0.1</th> | |
| <td style="background:#f8cbad">35.20</td> | |
| <td style="background:#f4cccc">0.28</td> | |
| <td style="background:#ffeb9c">52.77</td></tr> | |
| </tbody> | |
| </table> | |
| > **Notes** | |
| > • QA = mean F1 over **TQuAD** and **XQuAD-TR** for this run. | |
| > • CETVEL has more tasks (GEC/MT/NLI/SUM); this compares shared Turkish subsets only. | |
| > • For reproducibility, see the dataset repo above and the exact command used. | |