Update README.md
Browse files
README.md
CHANGED
|
@@ -110,35 +110,36 @@ Results
|
|
| 110 |
|
| 111 |
**Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows.
|
| 112 |
|
| 113 |
-
**What did we do?** We used the standard implementation of the [
|
| 114 |
-
|
| 115 |
-
|
|
| 116 |
-
|
| 117 |
-
| Bulgarian | 79.8% | 78.8% |
|
| 118 |
-
| Czech | 81.4% | 78.3% |
|
| 119 |
-
| German | 81.2% | 80.6% |
|
| 120 |
-
| English | 88.9
|
| 121 |
-
| Estonian | 72.1% | 73.7% |
|
| 122 |
-
| Finnish | 79.0% | 78.1% |
|
| 123 |
-
| French | 82.6% | 80.1%
|
| 124 |
-
| Hungarian | 77.9% | 76.2% |
|
| 125 |
-
| Icelandic | 70.8% | 58.2%
|
| 126 |
-
| Italian | 82.1% | 77.8% |
|
| 127 |
-
| Lithuanian | 76.1% | 76.1% |
|
| 128 |
-
| Latvian | 78.4% | 77.7%
|
| 129 |
-
| Dutch | 80.2% | 78.9% |
|
| 130 |
-
| Polish | 78.3% | 77.9%
|
| 131 |
-
| Portuguese | 83.8% | 80.1%
|
| 132 |
-
| Romanian | 80.3% | 78.8%
|
| 133 |
-
| Russian | 79.4% | 79.4% |
|
| 134 |
-
| Slovak | 78.9% | 78.0% |
|
| 135 |
-
| Slovenian | 78.0% | 80.0%
|
| 136 |
-
| Spanish | 82.1% | 78.4% |
|
| 137 |
-
| Serbian | 79.8% | 78.4% |
|
| 138 |
-
| Swedish | 80.6% | 76.3% |
|
| 139 |
-
| Turkish | 77.4% | 62.3% |
|
| 140 |
-
| Ukrainian | 78.0% | 77.0% |
|
| 141 |
-
| **Average** | 79.5% | 76.8% |
|
|
|
|
| 142 |
|
| 143 |
## Per-Character Perplexity
|
| 144 |
**What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
|
|
|
|
| 110 |
|
| 111 |
**Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows.
|
| 112 |
|
| 113 |
+
**What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
|
| 114 |
+
|
| 115 |
+
| 5-shot | Gemma 2 27b | ALIA 40b | EuroLLM Prev. 22b | TildeOpen 1.1 30b |
|
| 116 |
+
|----------|:-------------:|:----------:|:------------:|:-------------------:|
|
| 117 |
+
| Bulgarian | 79.8% | 78.8% | **85.3%** | 84.7% |
|
| 118 |
+
| Czech | 81.4% | 78.3% | 85.3% | **85.8%** |
|
| 119 |
+
| German | 81.2% | 80.6% | **85.0%** | 84.3% |
|
| 120 |
+
| English | **88.9%** | 83.0% | 87.6% | 88.3% |
|
| 121 |
+
| Estonian | 72.1% | 73.7% | 82.0% | **82.6%** |
|
| 122 |
+
| Finnish | 79.0% | 78.1% | 84.3% | **85.0%** |
|
| 123 |
+
| French | 82.6% | 80.1% | **85.7%** | 85.0% |
|
| 124 |
+
| Hungarian | 77.9% | 76.2% | 83.3% | **86.2%** |
|
| 125 |
+
| Icelandic | 70.8% | 58.2% | 54.3% | **85.7%** |
|
| 126 |
+
| Italian | 82.1% | 77.8% | 81.0% | **82.4%** |
|
| 127 |
+
| Lithuanian | 76.1% | 76.1% | **85.2%** | 83.3% |
|
| 128 |
+
| Latvian | 78.4% | 77.7% | **84.6%** | **84.6%** |
|
| 129 |
+
| Dutch | 80.2% | 78.9% | 83.2% | **85.0%** |
|
| 130 |
+
| Polish | 78.3% | 77.9% | 82.2% | **83.0%** |
|
| 131 |
+
| Portuguese | 83.8% | 80.1% | 86.1% | **87.1%** |
|
| 132 |
+
| Romanian | 80.3% | 78.8% | 85.3% | **85.9%** |
|
| 133 |
+
| Russian | 79.4% | 79.4% | 84.2% | **84.6%** |
|
| 134 |
+
| Slovak | 78.9% | 78.0% | 84.1% | **85.0%** |
|
| 135 |
+
| Slovenian | 78.0% | 80.0% | 83.7% | **85.1%** |
|
| 136 |
+
| Spanish | 82.1% | 78.4% | **84.1%** | 83.8% |
|
| 137 |
+
| Serbian | 79.8% | 78.4% | 74.1% | **84.2%** |
|
| 138 |
+
| Swedish | 80.6% | 76.3% | **85.3%** | 84.4% |
|
| 139 |
+
| Turkish | 77.4% | 62.3% | 79.9% | **82.7%** |
|
| 140 |
+
| Ukrainian | 78.0% | 77.0% | 83.9% | **85.1%** |
|
| 141 |
+
| **Average** | 79.5% | 76.8% | 82.5% | **84.7%** |
|
| 142 |
+
|
| 143 |
|
| 144 |
## Per-Character Perplexity
|
| 145 |
**What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
|