Text Generation
Transformers
Safetensors
llama
text-generation-inference
TildeSIA commited on
Commit
7f9e43a
·
verified ·
1 Parent(s): 8d485db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -29
README.md CHANGED
@@ -110,35 +110,36 @@ Results
110
 
111
  **Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows.
112
 
113
- **What did we do?** We used the standard implementation of the [Belebele](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```.
114
-
115
- | Language | Gemma 2 27b | ALIA 40b | EuroLLM 9b | EuroLLM Prev. 22b | TildeOpen 1.1 30b |
116
- |----------|-------------|----------|------------|-------------------|-------------------|
117
- | Bulgarian | 79.8% | 78.8% | 74.2% | **85.3%** | 84.7% |
118
- | Czech | 81.4% | 78.3% | 74.9% | 85.3% | **85.8%** |
119
- | German | 81.2% | 80.6% | 75.1% | **85.0%** | 84.3% |
120
- | English | 88.9% | 83.0% | 77.3% | 87.6% | **88.3%** |
121
- | Estonian | 72.1% | 73.7% | 70.8% | 82.0% | **82.6%** |
122
- | Finnish | 79.0% | 78.1% | 73.3% | 84.3% | **85.0%** |
123
- | French | 82.6% | 80.1% | 77.7% | **85.7%** | 85.0% |
124
- | Hungarian | 77.9% | 76.2% | 72.9% | 83.3% | **86.2%** |
125
- | Icelandic | 70.8% | 58.2% | 44.6% | 54.3% | **85.7%** |
126
- | Italian | 82.1% | 77.8% | 74.7% | 81.0% | **82.4%** |
127
- | Lithuanian | 76.1% | 76.1% | 72.8% | **85.2%** | 83.3% |
128
- | Latvian | 78.4% | 77.7% | 73.6% | **84.6%** | **84.6%** |
129
- | Dutch | 80.2% | 78.9% | 73.0% | 83.2% | **85.0%** |
130
- | Polish | 78.3% | 77.9% | 73.2% | 82.2% | **83.0%** |
131
- | Portuguese | 83.8% | 80.1% | 73.9% | 86.1% | **87.1%** |
132
- | Romanian | 80.3% | 78.8% | 75.1% | 85.3% | **85.9%** |
133
- | Russian | 79.4% | 79.4% | 73.1% | 84.2% | **84.6%** |
134
- | Slovak | 78.9% | 78.0% | 74.0% | 84.1% | **85.0%** |
135
- | Slovenian | 78.0% | 80.0% | 72.6% | 83.7% | **85.1%** |
136
- | Spanish | 82.1% | 78.4% | 73.6% | **84.1%** | 83.8% |
137
- | Serbian | 79.8% | 78.4% | 66.3% | 74.1% | **84.2%** |
138
- | Swedish | 80.6% | 76.3% | 73.4% | **85.3%** | 84.4% |
139
- | Turkish | 77.4% | 62.3% | 70.0% | 79.9% | **82.7%** |
140
- | Ukrainian | 78.0% | 77.0% | 71.9% | 83.9% | **85.1%** |
141
- | **Average** | 79.5% | 76.8% | 72.2% | 82.5% | **84.7%** |
 
142
 
143
  ## Per-Character Perplexity
144
  **What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
 
110
 
111
  **Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows.
112
 
113
+ **What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
114
+
115
+ | 5-shot | Gemma 2 27b | ALIA 40b | EuroLLM Prev. 22b | TildeOpen 1.1 30b |
116
+ |----------|:-------------:|:----------:|:------------:|:-------------------:|
117
+ | Bulgarian | 79.8% | 78.8% | **85.3%** | 84.7% |
118
+ | Czech | 81.4% | 78.3% | 85.3% | **85.8%** |
119
+ | German | 81.2% | 80.6% | **85.0%** | 84.3% |
120
+ | English | **88.9%** | 83.0% | 87.6% | 88.3% |
121
+ | Estonian | 72.1% | 73.7% | 82.0% | **82.6%** |
122
+ | Finnish | 79.0% | 78.1% | 84.3% | **85.0%** |
123
+ | French | 82.6% | 80.1% | **85.7%** | 85.0% |
124
+ | Hungarian | 77.9% | 76.2% | 83.3% | **86.2%** |
125
+ | Icelandic | 70.8% | 58.2% | 54.3% | **85.7%** |
126
+ | Italian | 82.1% | 77.8% | 81.0% | **82.4%** |
127
+ | Lithuanian | 76.1% | 76.1% | **85.2%** | 83.3% |
128
+ | Latvian | 78.4% | 77.7% | **84.6%** | **84.6%** |
129
+ | Dutch | 80.2% | 78.9% | 83.2% | **85.0%** |
130
+ | Polish | 78.3% | 77.9% | 82.2% | **83.0%** |
131
+ | Portuguese | 83.8% | 80.1% | 86.1% | **87.1%** |
132
+ | Romanian | 80.3% | 78.8% | 85.3% | **85.9%** |
133
+ | Russian | 79.4% | 79.4% | 84.2% | **84.6%** |
134
+ | Slovak | 78.9% | 78.0% | 84.1% | **85.0%** |
135
+ | Slovenian | 78.0% | 80.0% | 83.7% | **85.1%** |
136
+ | Spanish | 82.1% | 78.4% | **84.1%** | 83.8% |
137
+ | Serbian | 79.8% | 78.4% | 74.1% | **84.2%** |
138
+ | Swedish | 80.6% | 76.3% | **85.3%** | 84.4% |
139
+ | Turkish | 77.4% | 62.3% | 79.9% | **82.7%** |
140
+ | Ukrainian | 78.0% | 77.0% | 83.9% | **85.1%** |
141
+ | **Average** | 79.5% | 76.8% | 82.5% | **84.7%** |
142
+
143
 
144
  ## Per-Character Perplexity
145
  **What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.