dtamayo commited on
Commit
71a9d35
·
0 Parent(s):

Push MrBERT

Browse files
.gitattributes ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ *.bin filter=lfs diff=lfs merge=lfs -text
2
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
3
+ *.json filter=lfs diff=lfs merge=lfs -text
REAMDE.md ADDED
@@ -0,0 +1,393 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - bg
5
+ - ca
6
+ - code
7
+ - cs
8
+ - cy
9
+ - da
10
+ - de
11
+ - el
12
+ - en
13
+ - es
14
+ - et
15
+ - eu
16
+ - fi
17
+ - fr
18
+ - ga
19
+ - gl
20
+ - hr
21
+ - hu
22
+ - it
23
+ - lt
24
+ - lv
25
+ - mt
26
+ - nl
27
+ - nn
28
+ - \no
29
+ - oc
30
+ - pl
31
+ - pt
32
+ - ro
33
+ - ru
34
+ - sh
35
+ - sk
36
+ - sl
37
+ - sr
38
+ - sv
39
+ - uk
40
+ tags:
41
+ - fill-mask
42
+ - masked-lm
43
+ - long-context
44
+ - modernbert
45
+ license: apache-2.0
46
+ library_name: transformers
47
+ ---
48
+
49
+ # MrBERT Model Card
50
+
51
+ MrBERT is a new **multilingual foundational model** based on the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) architecture. It has been pretrained from scratch using 35 European languages and code. The pretraining corpus consists of 6.1T tokens.
52
+
53
+ ## Technical Description
54
+
55
+ Technical details of the MrBERT model.
56
+
57
+ | Description | Value |
58
+ |-------------------------|:--------------|
59
+ | Model Parameters | 308M |
60
+ | Tokenizer Type | SPM |
61
+ | Vocabulary size | 256,000 |
62
+ | Precision | bfloat16 |
63
+ | Context length | 8192 |
64
+
65
+
66
+ Training Hyperparemeters
67
+
68
+ | Hyperparameter | Value |
69
+ |------------------------- |:-------------- |
70
+ | Pretraining Objective | Masked Language Modeling |
71
+ | Learning Rate | 1E-03 |
72
+ | Learning Rate Scheduler | WSD |
73
+ | Warmup | 50,000,000,000 tokens |
74
+ | Optimizer | decoupled_stableadamw |
75
+ | Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) |
76
+ | Weight Decay | 1E-05 |
77
+ | Global Batch Size | 4096 |
78
+ | Dropout | 1E-01 |
79
+ | Activation Function | GeLU |
80
+
81
+ ## How to use
82
+
83
+ You can use the pipeline for masked language modeling:
84
+ ```python
85
+ >>> from transformers import pipeline
86
+ >>> from pprint import pprint
87
+ >>> unmasker = pipeline('fill-mask', model='BSC-LT/MrBERT')
88
+
89
+ >>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
90
+ [{'score': 0.29333314299583435,
91
+ 'sequence': 'I love the city of Barcelona.',
92
+ 'token': 31489,
93
+ 'token_str': 'city'},
94
+ {'score': 0.06682543456554413,
95
+ 'sequence': 'I love the capital of Barcelona.',
96
+ 'token': 10859,
97
+ 'token_str': 'capital'},
98
+ {'score': 0.05594080686569214,
99
+ 'sequence': 'I love the streets of Barcelona.',
100
+ 'token': 178738,
101
+ 'token_str': 'streets'}]
102
+ >>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
103
+ [{'score': 0.4422685205936432,
104
+ 'sequence': 'Me encanta la ciudad de Barcelona.',
105
+ 'token': 19587,
106
+ 'token_str': 'ciudad'},
107
+ {'score': 0.059732843190431595,
108
+ 'sequence': 'Me encanta la capital de Barcelona.',
109
+ 'token': 10859,
110
+ 'token_str': 'capital'},
111
+ {'score': 0.03484857454895973,
112
+ 'sequence': 'Me encanta la arquitectura de Barcelona.',
113
+ 'token': 83374,
114
+ 'token_str': 'arquitectura'}]
115
+ >>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
116
+ [{'score': 0.45476993918418884,
117
+ 'sequence': "M'encanta la ciutat de Barcelona.",
118
+ 'token': 17128,
119
+ 'token_str': 'ciutat'},
120
+ {'score': 0.05597861483693123,
121
+ 'sequence': "M'encanta la capital de Barcelona.",
122
+ 'token': 10859,
123
+ 'token_str': 'capital'},
124
+ {'score': 0.04105329513549805,
125
+ 'sequence': "M'encanta la música de Barcelona.",
126
+ 'token': 16051,
127
+ 'token_str': 'música'}]
128
+ ```
129
+
130
+ Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
131
+
132
+ ```python
133
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
134
+ import torch
135
+
136
+ model = AutoModelForMaskedLM.from_pretrained("BSC-LT/MrBERT")
137
+ tokenizer = AutoTokenizer.from_pretrained("BSC-LT/MrBERT")
138
+
139
+ outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
140
+
141
+ # The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
142
+ predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
143
+
144
+ print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
145
+ ```
146
+
147
+ # Data
148
+
149
+ ### Pretraining Corpus
150
+
151
+ The pretraining corpus comprises **6.1 trillion tokens**, covering **35 European languages** and **92 programming languages**. Training was conducted in three distinct phases to balance broad knowledge acquisition with long-context performance:
152
+
153
+ | Phase | Context Length | Token Count |
154
+ | :--- | :--- | :--- |
155
+ | **Short Context** | 1,024 tokens | 5.5T |
156
+ | **Long Context** | 8,192 tokens | 500B |
157
+ | **Annealing** | 8,192 tokens | 100B |
158
+
159
+ #### Language Distribution & Sources
160
+ While English constitutes **67.4%** of the total data, the remaining distribution spans a diverse set of European languages as shown below:
161
+
162
+ <img src="./images/roberta_pretraining_lang_distribution.png" alt="drawing"/>
163
+
164
+ The dataset is mainly composed of a 50/50 split between:
165
+ * **The Salamandra Corpus:** Data used for the [Salamandra Family](https://huggingface.co/collections/BSC-LT/salamandra).
166
+ * **High-Quality Web Data:** Extracted from **FineWeb-v2**, **FineWeb-Edu**, and an updated **Wikipedia** dump.
167
+
168
+ # Multilingual Evaluation and Performance
169
+
170
+ Evaluation is done using multilingual benchmarks in order to assess the multilingual capabilities of the models.
171
+
172
+ The following multilingual benchmarks have been considered:
173
+
174
+ | Benchmark | Description | Languages | Source |
175
+ |------------------|-------------|-----------|--------------|
176
+ | XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
177
+ | EvalES | Spanish benchmark for evaluating language understanding across multiple NLP tasks | es | [LINK](https://plantl-gob-es.github.io/spanish-benchmark) |
178
+ | CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://github.com/projecte-aina/club) |
179
+ | mTEB Subset | Multilingual Text Embedding Benchmark for evaluating sentence and document embedding quality across retrieval tasks using ColBERT | en, es | [LINK](https://github.com/embeddings-benchmark/mteb) |
180
+
181
+
182
+ The following base foundational models have been considered:
183
+
184
+
185
+ | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
186
+ |---------------------------------|----------------------|------------|-------------|
187
+ | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
188
+ | [RoBERTa-ca](https://huggingface.co/BSC-LT/RoBERTa-ca) | 125M | 50K | RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa. |
189
+ | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
190
+ | [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) | 308M | 250K | Multilingual ModernBERT pre-trained with staged language learning. |
191
+
192
+
193
+ ## RESULTS
194
+
195
+ This section presents results across various multilingual benchmarks, with the maximum values highlighted in bold and the second-highest values underlined.
196
+
197
+ ### XTREME Benchmark
198
+
199
+ The **Cross-lingual TRansfer Evaluation of Multilingual Encoders ([XTREME](https://github.com/google-research/xtreme))** benchmark is designed to assess the cross-lingual generalization capabilities of pre-trained multilingual models. It comprises nine tasks that collectively test reasoning across various levels of syntax and semantics. The results reported here are from the test set, using the learning rate chosen based on the best-performing model on the validation set.
200
+
201
+ Given that retrieval tasks generally achieve higher performance with multi-vector methods such as ColBERT, we evaluate these tasks separately using m-TEB.
202
+
203
+ In the table below we just show the average results across languages:
204
+
205
+ | task | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
206
+ | --- | --- | --- | --- | --- |
207
+ | xnli (classification) | 78.25 | 79.09 | <u>80.54</u> | **81.26** |
208
+ | pawsx (classification) | 89.50 | 90.36 | **92.34** | <u>90.90</u> |
209
+ | udpos (POS) | **85.55** | <u>85.36</u> | 84.33 | 83.30 |
210
+ | panx (NER) | 73.69 | **75.65** | <u>73.89</u> | 71.11 |
211
+ | tydiqa (QA) | <u>56.41</u> | 53.96 | **63.95** | 56.34 |
212
+ | mlqa (QA) | 68.91 | 68.67 | **71.48** | <u>70.67</u> |
213
+ | xquad (QA) | 75.61 | 75.45 | <u>77.79</u> | **77.91** |
214
+
215
+ For a detailed description for each language, we provide the full table description:
216
+
217
+ #### 🔵 Sentence Classification
218
+
219
+ ##### 🔵 XNLI
220
+ Metric used: Accuracy.
221
+
222
+ | langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
223
+ | --- | --- | --- | --- | --- |
224
+ | bg | 77.60 | 78.20 | <u>79.24</u> | **80.04** |
225
+ | de | 77.05 | <u>77.68</u> | **80.16** | **80.16** |
226
+ | el | 75.63 | 76.53 | <u>77.17</u> | **79.74** |
227
+ | en | 84.49 | 85.63 | <u>86.73</u> | **88.02** |
228
+ | es | 79.00 | 80.32 | <u>81.88</u> | **82.53** |
229
+ | fr | 78.04 | 78.94 | <u>80.76</u> | **80.78** |
230
+ | ru | 75.91 | 76.31 | **77.88** | <u>77.56</u> |
231
+
232
+ ##### 🔵 PAWS-X
233
+ Metric used: Accuracy.
234
+
235
+ | langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
236
+ | --- | --- | --- | --- | --- |
237
+ | de | 87.25 | <u>88.50</u> | **90.95** | 88.00 |
238
+ | en | 94.10 | <u>94.50</u> | **95.50** | **95.50** |
239
+ | es | 87.90 | 88.95 | **91.40** | <u>89.65</u> |
240
+ | fr | 88.75 | 89.50 | **91.50** | <u>90.45</u> |
241
+
242
+ #### 🟣 Structured Prediction: POS
243
+
244
+ ##### 🟣 POS (UDPOS)
245
+ Metric used: F1.
246
+
247
+ | langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
248
+ | --- | --- | --- | --- | --- |
249
+ | bg | **88.31** | <u>88.10</u> | 87.48 | 86.92 |
250
+ | de | **88.35** | <u>87.54</u> | 86.94 | 86.31 |
251
+ | el | **87.52** | 83.90 | 81.80 | <u>83.92</u> |
252
+ | en | <u>95.89</u> | 95.80 | **95.96** | 95.86 |
253
+ | es | 87.49 | 87.45 | **87.70** | <u>87.62</u> |
254
+ | et | <u>84.60</u> | **85.55** | 81.08 | 78.00 |
255
+ | eu | <u>66.98</u> | **67.43** | 64.67 | 63.71 |
256
+ | fi | **84.77** | <u>83.58</u> | 82.38 | 77.29 |
257
+ | fr | 85.93 | **86.84** | <u>86.55</u> | 84.37 |
258
+ | hu | **83.12** | <u>82.15</u> | 80.00 | 79.85 |
259
+ | it | 86.95 | <u>89.02</u> | **89.14** | 86.49 |
260
+ | lt | **83.10** | 81.03 | <u>82.00</u> | 79.77 |
261
+ | nl | 89.16 | **89.36** | <u>89.17</u> | 88.72 |
262
+ | pl | <u>83.90</u> | **84.33** | 82.80 | 83.86 |
263
+ | pt | 86.75 | **87.78** | <u>87.35</u> | 86.97 |
264
+ | ro | **83.47** | 81.39 | <u>82.54</u> | 77.35 |
265
+ | ru | <u>88.92</u> | **89.59** | 87.84 | 88.49 |
266
+ | uk | <u>84.72</u> | **85.55** | 82.54 | 83.94 |
267
+
268
+ ##### 🟣 NER (PANX)
269
+ Metric used: F1.
270
+
271
+ | langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
272
+ | --- | --- | --- | --- | --- |
273
+ | bg | <u>76.42</u> | **77.92** | 74.29 | 72.65 |
274
+ | de | 74.60 | **78.36** | <u>75.98</u> | 75.75 |
275
+ | el | <u>73.32</u> | **76.31** | 68.62 | 68.67 |
276
+ | en | 82.39 | **83.43** | <u>82.82</u> | 80.91 |
277
+ | es | 71.04 | **81.19** | <u>79.93</u> | 77.25 |
278
+ | et | <u>71.85</u> | **72.87** | 70.32 | 69.96 |
279
+ | eu | <u>59.58</u> | 56.52 | 51.70 | **60.37** |
280
+ | fi | <u>75.80</u> | **76.07** | 74.67 | 71.64 |
281
+ | fr | 77.61 | <u>77.65</u> | **81.15** | 73.70 |
282
+ | hu | **76.66** | <u>73.42</u> | 73.03 | 72.57 |
283
+ | it | 76.73 | <u>80.02</u> | **80.49** | 76.22 |
284
+ | lt | <u>72.04</u> | **73.52** | 70.14 | 67.46 |
285
+ | nl | 79.85 | **81.91** | <u>80.10</u> | 78.65 |
286
+ | pl | 77.45 | **80.33** | <u>77.81</u> | 76.83 |
287
+ | pt | 75.93 | <u>79.21</u> | **80.36** | 75.29 |
288
+ | ro | <u>71.78</u> | 71.49 | **75.37** | 62.81 |
289
+ | ru | 63.39 | **67.31** | <u>64.13</u> | 57.96 |
290
+ | uk | <u>70.00</u> | **74.20** | 69.17 | 61.24 |
291
+
292
+ #### ⚫ Question Answering
293
+
294
+
295
+ ##### ⚫ TyDiQA
296
+ Metric used: F1.
297
+
298
+ | langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
299
+ | --- | --- | --- | --- | --- |
300
+ | en | 62.42 | 60.39 | **71.73** | <u>68.50</u> |
301
+ | fi | <u>54.19</u> | 49.51 | **60.89** | 48.08 |
302
+ | ru | <u>52.61</u> | 51.98 | **59.24** | 52.45 |
303
+
304
+ ##### ⚫ MLQA
305
+ Metric used: F1.
306
+
307
+ | langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
308
+ | --- | --- | --- | --- | --- |
309
+ | de | 61.57 | 61.49 | **65.19** | <u>64.38</u> |
310
+ | en | <u>78.68</u> | 76.37 | **79.09** | 77.45 |
311
+ | es | 66.48 | 68.15 | <u>70.17</u> | **70.17** |
312
+
313
+ ##### ⚫ XQUAD
314
+ Metric used: F1.
315
+
316
+ | langs | xlm-roberta-base (279M) | mRoBERTa (283M) | mmBERT (308M) | MrBERT (308M) |
317
+ | --- | --- | --- | --- | --- |
318
+ | de | 74.31 | 72.93 | **77.68** | <u>75.94</u> |
319
+ | el | 71.68 | <u>71.70</u> | 68.81 | **74.92** |
320
+ | en | 82.26 | 82.37 | **85.67** | <u>84.44</u> |
321
+ | es | 75.20 | 78.18 | **79.58** | <u>79.23</u> |
322
+ | ru | 74.58 | 72.08 | **77.20** | <u>75.02</u> |
323
+
324
+ ### EvalES Benchmark
325
+
326
+ The [EvalES benchmark](https://benchmark.plantl.bsc.es/datasets.html) consists of 7 tasks: Named Entity Recognition and Classification (CoNLL-NERC), Part-of-Speech Tagging (UD-POS), Text Classification (MLDoc), Paraphrase Identification (PAWS-X), Semantic Textual Similarity (STS), Question Answering (SQAC), and Textual Entailment (XNLI). This benchmark evaluates the model's capabilities in the Spanish language.
327
+
328
+ | tasks | xlm-roberta-base (278M) | mRoBERTa (300M) | mmBERT (308M) | MrBERT (308M) | MrBERT-es (150M) |
329
+ |:-------------|--------------------------:|:------------------|:----------------|:----------------|:-------------------|
330
+ | pos (f1) | 99.01 | 99.03 | **99.09** | 99.06 | 99.04 |
331
+ | ner (f1) | 86.91 | **87.77** | 87.01 | 87.42 | 87.36 |
332
+ | sts (person) | 80.88 | 79.69 | 82.88 | 84.18 | **85.18** |
333
+ | tc (acc) | 90.35 | 91.30 | 91.35 | 91.25 | **91.60** |
334
+ | tc2 (acc) | 47.67 | 91.28 | 95.10 | 95.28 | **95.35** |
335
+ | tc3 (acc) | 21.89 | 86.45 | 86.79 | **87.46** | 87.19 |
336
+ | qa (f1) | 74.48 | 77.03 | 79.79 | **81.96** | 80.33 |
337
+ | te (acc) | 33.33 | 33.33 | 79.98 | **84.69** | 82.14 |
338
+
339
+ ### CLUB Benchmark
340
+
341
+ The [Catalan Language Understanding Benchmark](https://club.aina.bsc.es/datasets.html) consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.
342
+
343
+ This comparison also includes RoBERTa-ca, a model derived from mRoBERTa by applying vocabulary adaptation and performing continual pre-training on a 95GB Catalan-only corpus. For further details, visit [here](https://huggingface.co/BSC-LT/RoBERTa-ca).
344
+
345
+ | tasks | xlm-roberta-base (279M) | mRoBERTa (283M) | roberta-ca (125M) | mmBERT (308M) | MrBERT (308M) |
346
+ |:---------------|--------------------------:|------------------:|:--------------------|:----------------|:----------------|
347
+ | ner (F1) | 87.61 | 88.33 | **89.70** | 88.14 | 87.32 |
348
+ | pos (F1) | 98.91 | 98.98 | 99.00 | **99.01** | **99.01** |
349
+ | sts (Person) | 74.67 | 79.52 | 82.99 | **83.16** | 83.00 |
350
+ | tc (Acc.) | 72.57 | 72.41 | 72.81 | **74.11** | 73.79 |
351
+ | te (Acc.) | 79.59 | 82.38 | 82.14 | 83.18 | **84.03** |
352
+ | viquiquad (F1) | 86.93 | 87.86 | 87.31 | **89.86** | 89.25 |
353
+ | xquad (F1) | 69.69 | 69.4 | 70.53 | 73.88 | **73.96** |
354
+
355
+
356
+ ## Additional information
357
+
358
+ ### Author
359
+ The Language Technologies Lab from Barcelona Supercomputing Center.
360
+
361
+ ### Contact
362
+ For further information, please send an email to <langtech@bsc.es>.
363
+
364
+ ### Copyright
365
+ Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
366
+
367
+
368
+ ### Funding
369
+
370
+ This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
371
+
372
+ ### Acknowledgements
373
+
374
+ This project has benefited from the contributions of numerous teams and institutions through data contributions.
375
+
376
+ In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
377
+
378
+ At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
379
+
380
+ At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.
381
+
382
+ Their valuable efforts have been instrumental in the development of this work.
383
+
384
+ ### Disclaimer
385
+ Be aware that the model may contain biases or other unintended distortions.
386
+ When third parties deploy systems or provide services based on this model, or use the model themselves,
387
+ they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
388
+ including those governing the use of Artificial Intelligence.
389
+
390
+ The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
391
+
392
+ ### License
393
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ModernBertForMaskedLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "classifier_activation": "silu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 1,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
+ "embedding_dropout": 0.0,
16
+ "eos_token_id": 2,
17
+ "global_attn_every_n_layers": 3,
18
+ "global_rope_theta": 160000.0,
19
+ "gradient_checkpointing": false,
20
+ "hidden_activation": "gelu",
21
+ "hidden_size": 768,
22
+ "initializer_cutoff_factor": 2.0,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 1152,
25
+ "layer_norm_eps": 1e-05,
26
+ "local_attention": 128,
27
+ "local_rope_theta": 10000.0,
28
+ "max_position_embeddings": 8192,
29
+ "mlp_bias": false,
30
+ "mlp_dropout": 0.0,
31
+ "model_type": "modernbert",
32
+ "norm_bias": false,
33
+ "norm_eps": 1e-05,
34
+ "num_attention_heads": 12,
35
+ "num_hidden_layers": 22,
36
+ "pad_token_id": 3,
37
+ "position_embedding_type": "absolute",
38
+ "sep_token_id": 2,
39
+ "tie_word_embeddings": true,
40
+ "torch_dtype": "bfloat16",
41
+ "transformers_version": "4.48.0",
42
+ "vocab_size": 256128
43
+ }
git ADDED
File without changes
images/lang_distribution.png ADDED

Git LFS Details

  • SHA256: 209d25d377cb57a1ee4eb074f10a296745f178183f73885a5fc2d8b0eb3774b1
  • Pointer size: 131 Bytes
  • Size of remote file: 480 kB
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0524504431e1892f3313cd71a5266fe7a043cb0d66c9118c2538016920d42b1
3
+ size 1231552912
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:185a72a82401f708673bdb23c15629519ad329c5946abc0f1ac45966ff1b7cf3
3
+ size 1231581870
special_tokens_map.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<mask>"
4
+ ],
5
+ "bos_token": {
6
+ "content": "<s>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eos_token": {
13
+ "content": "</s>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false
18
+ },
19
+ "mask_token": {
20
+ "content": "<mask>",
21
+ "lstrip": true,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "pad_token": {
27
+ "content": "<pad>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ },
33
+ "unk_token": {
34
+ "content": "<unk>",
35
+ "lstrip": false,
36
+ "normalized": false,
37
+ "rstrip": false,
38
+ "single_word": false
39
+ }
40
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5de759bbb22a74d2cdd9ea819fa46faa89612fff8d2fcce93a926961d2a233a9
3
+ size 19092520
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ddbda5816a0138ffd754cbbfafceba9628342cdd91df4bea6ee86f0fb44eae9
3
+ size 4813260
tokenizer_config.json ADDED
@@ -0,0 +1,1107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": true,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "3": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "4": {
39
+ "content": "<|im_start|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "5": {
47
+ "content": "<|im_end|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "6": {
55
+ "content": "<mask>",
56
+ "lstrip": true,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "7": {
63
+ "content": "<|reserved_token_2|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "8": {
71
+ "content": "<|reserved_token_3|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "9": {
79
+ "content": "<|reserved_token_4|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "10": {
87
+ "content": "<|reserved_token_5|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "11": {
95
+ "content": "<|reserved_token_6|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "12": {
103
+ "content": "<|reserved_token_7|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "13": {
111
+ "content": "<|reserved_token_8|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "14": {
119
+ "content": "<|reserved_token_9|>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": true
125
+ },
126
+ "15": {
127
+ "content": "<|reserved_token_10|>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": true
133
+ },
134
+ "16": {
135
+ "content": "<|reserved_token_11|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": true
141
+ },
142
+ "17": {
143
+ "content": "<|reserved_token_12|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": true
149
+ },
150
+ "18": {
151
+ "content": "<|reserved_token_13|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": true
157
+ },
158
+ "19": {
159
+ "content": "<|reserved_token_14|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": true
165
+ },
166
+ "20": {
167
+ "content": "<|reserved_token_15|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": true
173
+ },
174
+ "21": {
175
+ "content": "<|reserved_token_16|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": true
181
+ },
182
+ "22": {
183
+ "content": "<|reserved_token_17|>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": true
189
+ },
190
+ "23": {
191
+ "content": "<|reserved_token_18|>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": true
197
+ },
198
+ "24": {
199
+ "content": "<|reserved_token_19|>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": true
205
+ },
206
+ "25": {
207
+ "content": "<|reserved_token_20|>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": true
213
+ },
214
+ "26": {
215
+ "content": "<|reserved_token_21|>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": true
221
+ },
222
+ "27": {
223
+ "content": "<|reserved_token_22|>",
224
+ "lstrip": false,
225
+ "normalized": false,
226
+ "rstrip": false,
227
+ "single_word": false,
228
+ "special": true
229
+ },
230
+ "28": {
231
+ "content": "<|reserved_token_23|>",
232
+ "lstrip": false,
233
+ "normalized": false,
234
+ "rstrip": false,
235
+ "single_word": false,
236
+ "special": true
237
+ },
238
+ "29": {
239
+ "content": "<|reserved_token_24|>",
240
+ "lstrip": false,
241
+ "normalized": false,
242
+ "rstrip": false,
243
+ "single_word": false,
244
+ "special": true
245
+ },
246
+ "30": {
247
+ "content": "<|reserved_token_25|>",
248
+ "lstrip": false,
249
+ "normalized": false,
250
+ "rstrip": false,
251
+ "single_word": false,
252
+ "special": true
253
+ },
254
+ "31": {
255
+ "content": "<|reserved_token_26|>",
256
+ "lstrip": false,
257
+ "normalized": false,
258
+ "rstrip": false,
259
+ "single_word": false,
260
+ "special": true
261
+ },
262
+ "32": {
263
+ "content": "<|reserved_token_27|>",
264
+ "lstrip": false,
265
+ "normalized": false,
266
+ "rstrip": false,
267
+ "single_word": false,
268
+ "special": true
269
+ },
270
+ "33": {
271
+ "content": "<|reserved_token_28|>",
272
+ "lstrip": false,
273
+ "normalized": false,
274
+ "rstrip": false,
275
+ "single_word": false,
276
+ "special": true
277
+ },
278
+ "34": {
279
+ "content": "<|reserved_token_29|>",
280
+ "lstrip": false,
281
+ "normalized": false,
282
+ "rstrip": false,
283
+ "single_word": false,
284
+ "special": true
285
+ },
286
+ "35": {
287
+ "content": "<|reserved_token_30|>",
288
+ "lstrip": false,
289
+ "normalized": false,
290
+ "rstrip": false,
291
+ "single_word": false,
292
+ "special": true
293
+ },
294
+ "36": {
295
+ "content": "<|reserved_token_31|>",
296
+ "lstrip": false,
297
+ "normalized": false,
298
+ "rstrip": false,
299
+ "single_word": false,
300
+ "special": true
301
+ },
302
+ "37": {
303
+ "content": "<|reserved_token_32|>",
304
+ "lstrip": false,
305
+ "normalized": false,
306
+ "rstrip": false,
307
+ "single_word": false,
308
+ "special": true
309
+ },
310
+ "38": {
311
+ "content": "<|reserved_token_33|>",
312
+ "lstrip": false,
313
+ "normalized": false,
314
+ "rstrip": false,
315
+ "single_word": false,
316
+ "special": true
317
+ },
318
+ "39": {
319
+ "content": "<|reserved_token_34|>",
320
+ "lstrip": false,
321
+ "normalized": false,
322
+ "rstrip": false,
323
+ "single_word": false,
324
+ "special": true
325
+ },
326
+ "40": {
327
+ "content": "<|reserved_token_35|>",
328
+ "lstrip": false,
329
+ "normalized": false,
330
+ "rstrip": false,
331
+ "single_word": false,
332
+ "special": true
333
+ },
334
+ "41": {
335
+ "content": "<|reserved_token_36|>",
336
+ "lstrip": false,
337
+ "normalized": false,
338
+ "rstrip": false,
339
+ "single_word": false,
340
+ "special": true
341
+ },
342
+ "42": {
343
+ "content": "<|reserved_token_37|>",
344
+ "lstrip": false,
345
+ "normalized": false,
346
+ "rstrip": false,
347
+ "single_word": false,
348
+ "special": true
349
+ },
350
+ "43": {
351
+ "content": "<|reserved_token_38|>",
352
+ "lstrip": false,
353
+ "normalized": false,
354
+ "rstrip": false,
355
+ "single_word": false,
356
+ "special": true
357
+ },
358
+ "44": {
359
+ "content": "<|reserved_token_39|>",
360
+ "lstrip": false,
361
+ "normalized": false,
362
+ "rstrip": false,
363
+ "single_word": false,
364
+ "special": true
365
+ },
366
+ "45": {
367
+ "content": "<|reserved_token_40|>",
368
+ "lstrip": false,
369
+ "normalized": false,
370
+ "rstrip": false,
371
+ "single_word": false,
372
+ "special": true
373
+ },
374
+ "46": {
375
+ "content": "<|reserved_token_41|>",
376
+ "lstrip": false,
377
+ "normalized": false,
378
+ "rstrip": false,
379
+ "single_word": false,
380
+ "special": true
381
+ },
382
+ "47": {
383
+ "content": "<|reserved_token_42|>",
384
+ "lstrip": false,
385
+ "normalized": false,
386
+ "rstrip": false,
387
+ "single_word": false,
388
+ "special": true
389
+ },
390
+ "48": {
391
+ "content": "<|reserved_token_43|>",
392
+ "lstrip": false,
393
+ "normalized": false,
394
+ "rstrip": false,
395
+ "single_word": false,
396
+ "special": true
397
+ },
398
+ "49": {
399
+ "content": "<|reserved_token_44|>",
400
+ "lstrip": false,
401
+ "normalized": false,
402
+ "rstrip": false,
403
+ "single_word": false,
404
+ "special": true
405
+ },
406
+ "50": {
407
+ "content": "<|reserved_token_45|>",
408
+ "lstrip": false,
409
+ "normalized": false,
410
+ "rstrip": false,
411
+ "single_word": false,
412
+ "special": true
413
+ },
414
+ "51": {
415
+ "content": "<|reserved_token_46|>",
416
+ "lstrip": false,
417
+ "normalized": false,
418
+ "rstrip": false,
419
+ "single_word": false,
420
+ "special": true
421
+ },
422
+ "52": {
423
+ "content": "<|reserved_token_47|>",
424
+ "lstrip": false,
425
+ "normalized": false,
426
+ "rstrip": false,
427
+ "single_word": false,
428
+ "special": true
429
+ },
430
+ "53": {
431
+ "content": "<|reserved_token_48|>",
432
+ "lstrip": false,
433
+ "normalized": false,
434
+ "rstrip": false,
435
+ "single_word": false,
436
+ "special": true
437
+ },
438
+ "54": {
439
+ "content": "<|reserved_token_49|>",
440
+ "lstrip": false,
441
+ "normalized": false,
442
+ "rstrip": false,
443
+ "single_word": false,
444
+ "special": true
445
+ },
446
+ "55": {
447
+ "content": "<|reserved_token_50|>",
448
+ "lstrip": false,
449
+ "normalized": false,
450
+ "rstrip": false,
451
+ "single_word": false,
452
+ "special": true
453
+ },
454
+ "56": {
455
+ "content": "<|reserved_token_51|>",
456
+ "lstrip": false,
457
+ "normalized": false,
458
+ "rstrip": false,
459
+ "single_word": false,
460
+ "special": true
461
+ },
462
+ "57": {
463
+ "content": "<|reserved_token_52|>",
464
+ "lstrip": false,
465
+ "normalized": false,
466
+ "rstrip": false,
467
+ "single_word": false,
468
+ "special": true
469
+ },
470
+ "58": {
471
+ "content": "<|reserved_token_53|>",
472
+ "lstrip": false,
473
+ "normalized": false,
474
+ "rstrip": false,
475
+ "single_word": false,
476
+ "special": true
477
+ },
478
+ "59": {
479
+ "content": "<|reserved_token_54|>",
480
+ "lstrip": false,
481
+ "normalized": false,
482
+ "rstrip": false,
483
+ "single_word": false,
484
+ "special": true
485
+ },
486
+ "60": {
487
+ "content": "<|reserved_token_55|>",
488
+ "lstrip": false,
489
+ "normalized": false,
490
+ "rstrip": false,
491
+ "single_word": false,
492
+ "special": true
493
+ },
494
+ "61": {
495
+ "content": "<|reserved_token_56|>",
496
+ "lstrip": false,
497
+ "normalized": false,
498
+ "rstrip": false,
499
+ "single_word": false,
500
+ "special": true
501
+ },
502
+ "62": {
503
+ "content": "<|reserved_token_57|>",
504
+ "lstrip": false,
505
+ "normalized": false,
506
+ "rstrip": false,
507
+ "single_word": false,
508
+ "special": true
509
+ },
510
+ "63": {
511
+ "content": "<|reserved_token_58|>",
512
+ "lstrip": false,
513
+ "normalized": false,
514
+ "rstrip": false,
515
+ "single_word": false,
516
+ "special": true
517
+ },
518
+ "64": {
519
+ "content": "<|reserved_token_59|>",
520
+ "lstrip": false,
521
+ "normalized": false,
522
+ "rstrip": false,
523
+ "single_word": false,
524
+ "special": true
525
+ },
526
+ "65": {
527
+ "content": "<|reserved_token_60|>",
528
+ "lstrip": false,
529
+ "normalized": false,
530
+ "rstrip": false,
531
+ "single_word": false,
532
+ "special": true
533
+ },
534
+ "66": {
535
+ "content": "<|reserved_token_61|>",
536
+ "lstrip": false,
537
+ "normalized": false,
538
+ "rstrip": false,
539
+ "single_word": false,
540
+ "special": true
541
+ },
542
+ "67": {
543
+ "content": "<|reserved_token_62|>",
544
+ "lstrip": false,
545
+ "normalized": false,
546
+ "rstrip": false,
547
+ "single_word": false,
548
+ "special": true
549
+ },
550
+ "68": {
551
+ "content": "<|reserved_token_63|>",
552
+ "lstrip": false,
553
+ "normalized": false,
554
+ "rstrip": false,
555
+ "single_word": false,
556
+ "special": true
557
+ },
558
+ "69": {
559
+ "content": "<|reserved_token_64|>",
560
+ "lstrip": false,
561
+ "normalized": false,
562
+ "rstrip": false,
563
+ "single_word": false,
564
+ "special": true
565
+ },
566
+ "70": {
567
+ "content": "<|reserved_token_65|>",
568
+ "lstrip": false,
569
+ "normalized": false,
570
+ "rstrip": false,
571
+ "single_word": false,
572
+ "special": true
573
+ },
574
+ "71": {
575
+ "content": "<|reserved_token_66|>",
576
+ "lstrip": false,
577
+ "normalized": false,
578
+ "rstrip": false,
579
+ "single_word": false,
580
+ "special": true
581
+ },
582
+ "72": {
583
+ "content": "<|reserved_token_67|>",
584
+ "lstrip": false,
585
+ "normalized": false,
586
+ "rstrip": false,
587
+ "single_word": false,
588
+ "special": true
589
+ },
590
+ "73": {
591
+ "content": "<|reserved_token_68|>",
592
+ "lstrip": false,
593
+ "normalized": false,
594
+ "rstrip": false,
595
+ "single_word": false,
596
+ "special": true
597
+ },
598
+ "74": {
599
+ "content": "<|reserved_token_69|>",
600
+ "lstrip": false,
601
+ "normalized": false,
602
+ "rstrip": false,
603
+ "single_word": false,
604
+ "special": true
605
+ },
606
+ "75": {
607
+ "content": "<|reserved_token_70|>",
608
+ "lstrip": false,
609
+ "normalized": false,
610
+ "rstrip": false,
611
+ "single_word": false,
612
+ "special": true
613
+ },
614
+ "76": {
615
+ "content": "<|reserved_token_71|>",
616
+ "lstrip": false,
617
+ "normalized": false,
618
+ "rstrip": false,
619
+ "single_word": false,
620
+ "special": true
621
+ },
622
+ "77": {
623
+ "content": "<|reserved_token_72|>",
624
+ "lstrip": false,
625
+ "normalized": false,
626
+ "rstrip": false,
627
+ "single_word": false,
628
+ "special": true
629
+ },
630
+ "78": {
631
+ "content": "<|reserved_token_73|>",
632
+ "lstrip": false,
633
+ "normalized": false,
634
+ "rstrip": false,
635
+ "single_word": false,
636
+ "special": true
637
+ },
638
+ "79": {
639
+ "content": "<|reserved_token_74|>",
640
+ "lstrip": false,
641
+ "normalized": false,
642
+ "rstrip": false,
643
+ "single_word": false,
644
+ "special": true
645
+ },
646
+ "80": {
647
+ "content": "<|reserved_token_75|>",
648
+ "lstrip": false,
649
+ "normalized": false,
650
+ "rstrip": false,
651
+ "single_word": false,
652
+ "special": true
653
+ },
654
+ "81": {
655
+ "content": "<|reserved_token_76|>",
656
+ "lstrip": false,
657
+ "normalized": false,
658
+ "rstrip": false,
659
+ "single_word": false,
660
+ "special": true
661
+ },
662
+ "82": {
663
+ "content": "<|reserved_token_77|>",
664
+ "lstrip": false,
665
+ "normalized": false,
666
+ "rstrip": false,
667
+ "single_word": false,
668
+ "special": true
669
+ },
670
+ "83": {
671
+ "content": "<|reserved_token_78|>",
672
+ "lstrip": false,
673
+ "normalized": false,
674
+ "rstrip": false,
675
+ "single_word": false,
676
+ "special": true
677
+ },
678
+ "84": {
679
+ "content": "<|reserved_token_79|>",
680
+ "lstrip": false,
681
+ "normalized": false,
682
+ "rstrip": false,
683
+ "single_word": false,
684
+ "special": true
685
+ },
686
+ "85": {
687
+ "content": "<|reserved_token_80|>",
688
+ "lstrip": false,
689
+ "normalized": false,
690
+ "rstrip": false,
691
+ "single_word": false,
692
+ "special": true
693
+ },
694
+ "86": {
695
+ "content": "<|reserved_token_81|>",
696
+ "lstrip": false,
697
+ "normalized": false,
698
+ "rstrip": false,
699
+ "single_word": false,
700
+ "special": true
701
+ },
702
+ "87": {
703
+ "content": "<|reserved_token_82|>",
704
+ "lstrip": false,
705
+ "normalized": false,
706
+ "rstrip": false,
707
+ "single_word": false,
708
+ "special": true
709
+ },
710
+ "88": {
711
+ "content": "<|reserved_token_83|>",
712
+ "lstrip": false,
713
+ "normalized": false,
714
+ "rstrip": false,
715
+ "single_word": false,
716
+ "special": true
717
+ },
718
+ "89": {
719
+ "content": "<|reserved_token_84|>",
720
+ "lstrip": false,
721
+ "normalized": false,
722
+ "rstrip": false,
723
+ "single_word": false,
724
+ "special": true
725
+ },
726
+ "90": {
727
+ "content": "<|reserved_token_85|>",
728
+ "lstrip": false,
729
+ "normalized": false,
730
+ "rstrip": false,
731
+ "single_word": false,
732
+ "special": true
733
+ },
734
+ "91": {
735
+ "content": "<|reserved_token_86|>",
736
+ "lstrip": false,
737
+ "normalized": false,
738
+ "rstrip": false,
739
+ "single_word": false,
740
+ "special": true
741
+ },
742
+ "92": {
743
+ "content": "<|reserved_token_87|>",
744
+ "lstrip": false,
745
+ "normalized": false,
746
+ "rstrip": false,
747
+ "single_word": false,
748
+ "special": true
749
+ },
750
+ "93": {
751
+ "content": "<|reserved_token_88|>",
752
+ "lstrip": false,
753
+ "normalized": false,
754
+ "rstrip": false,
755
+ "single_word": false,
756
+ "special": true
757
+ },
758
+ "94": {
759
+ "content": "<|reserved_token_89|>",
760
+ "lstrip": false,
761
+ "normalized": false,
762
+ "rstrip": false,
763
+ "single_word": false,
764
+ "special": true
765
+ },
766
+ "95": {
767
+ "content": "<|reserved_token_90|>",
768
+ "lstrip": false,
769
+ "normalized": false,
770
+ "rstrip": false,
771
+ "single_word": false,
772
+ "special": true
773
+ },
774
+ "96": {
775
+ "content": "<|reserved_token_91|>",
776
+ "lstrip": false,
777
+ "normalized": false,
778
+ "rstrip": false,
779
+ "single_word": false,
780
+ "special": true
781
+ },
782
+ "97": {
783
+ "content": "<|reserved_token_92|>",
784
+ "lstrip": false,
785
+ "normalized": false,
786
+ "rstrip": false,
787
+ "single_word": false,
788
+ "special": true
789
+ },
790
+ "98": {
791
+ "content": "<|reserved_token_93|>",
792
+ "lstrip": false,
793
+ "normalized": false,
794
+ "rstrip": false,
795
+ "single_word": false,
796
+ "special": true
797
+ },
798
+ "99": {
799
+ "content": "<|reserved_token_94|>",
800
+ "lstrip": false,
801
+ "normalized": false,
802
+ "rstrip": false,
803
+ "single_word": false,
804
+ "special": true
805
+ },
806
+ "100": {
807
+ "content": "<|reserved_token_95|>",
808
+ "lstrip": false,
809
+ "normalized": false,
810
+ "rstrip": false,
811
+ "single_word": false,
812
+ "special": true
813
+ },
814
+ "101": {
815
+ "content": "<|reserved_token_96|>",
816
+ "lstrip": false,
817
+ "normalized": false,
818
+ "rstrip": false,
819
+ "single_word": false,
820
+ "special": true
821
+ },
822
+ "102": {
823
+ "content": "<|reserved_token_97|>",
824
+ "lstrip": false,
825
+ "normalized": false,
826
+ "rstrip": false,
827
+ "single_word": false,
828
+ "special": true
829
+ },
830
+ "103": {
831
+ "content": "<|reserved_token_98|>",
832
+ "lstrip": false,
833
+ "normalized": false,
834
+ "rstrip": false,
835
+ "single_word": false,
836
+ "special": true
837
+ },
838
+ "104": {
839
+ "content": "\\r",
840
+ "lstrip": false,
841
+ "normalized": false,
842
+ "rstrip": false,
843
+ "single_word": false,
844
+ "special": false
845
+ },
846
+ "105": {
847
+ "content": "\u2581\u2581",
848
+ "lstrip": false,
849
+ "normalized": false,
850
+ "rstrip": false,
851
+ "single_word": false,
852
+ "special": false
853
+ },
854
+ "106": {
855
+ "content": "\u2581\u2581\u2581",
856
+ "lstrip": false,
857
+ "normalized": false,
858
+ "rstrip": false,
859
+ "single_word": false,
860
+ "special": false
861
+ },
862
+ "107": {
863
+ "content": "\u2581\u2581\u2581\u2581",
864
+ "lstrip": false,
865
+ "normalized": false,
866
+ "rstrip": false,
867
+ "single_word": false,
868
+ "special": false
869
+ },
870
+ "108": {
871
+ "content": "\u2581\u2581\u2581\u2581\u2581",
872
+ "lstrip": false,
873
+ "normalized": false,
874
+ "rstrip": false,
875
+ "single_word": false,
876
+ "special": false
877
+ },
878
+ "109": {
879
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581",
880
+ "lstrip": false,
881
+ "normalized": false,
882
+ "rstrip": false,
883
+ "single_word": false,
884
+ "special": false
885
+ },
886
+ "110": {
887
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
888
+ "lstrip": false,
889
+ "normalized": false,
890
+ "rstrip": false,
891
+ "single_word": false,
892
+ "special": false
893
+ },
894
+ "111": {
895
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
896
+ "lstrip": false,
897
+ "normalized": false,
898
+ "rstrip": false,
899
+ "single_word": false,
900
+ "special": false
901
+ },
902
+ "112": {
903
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
904
+ "lstrip": false,
905
+ "normalized": false,
906
+ "rstrip": false,
907
+ "single_word": false,
908
+ "special": false
909
+ },
910
+ "113": {
911
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
912
+ "lstrip": false,
913
+ "normalized": false,
914
+ "rstrip": false,
915
+ "single_word": false,
916
+ "special": false
917
+ },
918
+ "114": {
919
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
920
+ "lstrip": false,
921
+ "normalized": false,
922
+ "rstrip": false,
923
+ "single_word": false,
924
+ "special": false
925
+ },
926
+ "115": {
927
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
928
+ "lstrip": false,
929
+ "normalized": false,
930
+ "rstrip": false,
931
+ "single_word": false,
932
+ "special": false
933
+ },
934
+ "116": {
935
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
936
+ "lstrip": false,
937
+ "normalized": false,
938
+ "rstrip": false,
939
+ "single_word": false,
940
+ "special": false
941
+ },
942
+ "117": {
943
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
944
+ "lstrip": false,
945
+ "normalized": false,
946
+ "rstrip": false,
947
+ "single_word": false,
948
+ "special": false
949
+ },
950
+ "118": {
951
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
952
+ "lstrip": false,
953
+ "normalized": false,
954
+ "rstrip": false,
955
+ "single_word": false,
956
+ "special": false
957
+ },
958
+ "119": {
959
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
960
+ "lstrip": false,
961
+ "normalized": false,
962
+ "rstrip": false,
963
+ "single_word": false,
964
+ "special": false
965
+ },
966
+ "120": {
967
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
968
+ "lstrip": false,
969
+ "normalized": false,
970
+ "rstrip": false,
971
+ "single_word": false,
972
+ "special": false
973
+ },
974
+ "121": {
975
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
976
+ "lstrip": false,
977
+ "normalized": false,
978
+ "rstrip": false,
979
+ "single_word": false,
980
+ "special": false
981
+ },
982
+ "122": {
983
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
984
+ "lstrip": false,
985
+ "normalized": false,
986
+ "rstrip": false,
987
+ "single_word": false,
988
+ "special": false
989
+ },
990
+ "123": {
991
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
992
+ "lstrip": false,
993
+ "normalized": false,
994
+ "rstrip": false,
995
+ "single_word": false,
996
+ "special": false
997
+ },
998
+ "124": {
999
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
1000
+ "lstrip": false,
1001
+ "normalized": false,
1002
+ "rstrip": false,
1003
+ "single_word": false,
1004
+ "special": false
1005
+ },
1006
+ "125": {
1007
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
1008
+ "lstrip": false,
1009
+ "normalized": false,
1010
+ "rstrip": false,
1011
+ "single_word": false,
1012
+ "special": false
1013
+ },
1014
+ "126": {
1015
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
1016
+ "lstrip": false,
1017
+ "normalized": false,
1018
+ "rstrip": false,
1019
+ "single_word": false,
1020
+ "special": false
1021
+ },
1022
+ "127": {
1023
+ "content": "\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581",
1024
+ "lstrip": false,
1025
+ "normalized": false,
1026
+ "rstrip": false,
1027
+ "single_word": false,
1028
+ "special": false
1029
+ },
1030
+ "128": {
1031
+ "content": "\t\t",
1032
+ "lstrip": false,
1033
+ "normalized": false,
1034
+ "rstrip": false,
1035
+ "single_word": false,
1036
+ "special": false
1037
+ },
1038
+ "129": {
1039
+ "content": "\t\t\t",
1040
+ "lstrip": false,
1041
+ "normalized": false,
1042
+ "rstrip": false,
1043
+ "single_word": false,
1044
+ "special": false
1045
+ },
1046
+ "130": {
1047
+ "content": "\t\t\t\t",
1048
+ "lstrip": false,
1049
+ "normalized": false,
1050
+ "rstrip": false,
1051
+ "single_word": false,
1052
+ "special": false
1053
+ },
1054
+ "131": {
1055
+ "content": "\t\t\t\t\t",
1056
+ "lstrip": false,
1057
+ "normalized": false,
1058
+ "rstrip": false,
1059
+ "single_word": false,
1060
+ "special": false
1061
+ },
1062
+ "132": {
1063
+ "content": "\t\t\t\t\t\t",
1064
+ "lstrip": false,
1065
+ "normalized": false,
1066
+ "rstrip": false,
1067
+ "single_word": false,
1068
+ "special": false
1069
+ },
1070
+ "133": {
1071
+ "content": "\n\n",
1072
+ "lstrip": false,
1073
+ "normalized": false,
1074
+ "rstrip": false,
1075
+ "single_word": false,
1076
+ "special": false
1077
+ },
1078
+ "134": {
1079
+ "content": "\n\n\n",
1080
+ "lstrip": false,
1081
+ "normalized": false,
1082
+ "rstrip": false,
1083
+ "single_word": false,
1084
+ "special": false
1085
+ }
1086
+ },
1087
+ "additional_special_tokens": [
1088
+ "<mask>"
1089
+ ],
1090
+ "bos_token": "<s>",
1091
+ "clean_up_tokenization_spaces": false,
1092
+ "eos_token": "</s>",
1093
+ "legacy": true,
1094
+ "mask_token": "<mask>",
1095
+ "model_max_length": 8192,
1096
+ "pad_token": "<pad>",
1097
+ "padding_side": "right",
1098
+ "sp_model_kwargs": {},
1099
+ "spaces_between_special_tokens": false,
1100
+ "tokenizer_class": "LlamaTokenizer",
1101
+ "unk_token": "<unk>",
1102
+ "use_default_system_prompt": false,
1103
+ "model_input_names": [
1104
+ "input_ids",
1105
+ "attention_mask"
1106
+ ]
1107
+ }