Za6na commited on
Commit
fd76130
·
verified ·
1 Parent(s): 3e19691

Add KuBERT model to bert subfolder

Browse files
bert/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
bert/README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: zero-shot-classification
3
+ ---
4
+ # KuBERT: Central Kurdish BERT Model
5
+
6
+ ## Introduction
7
+ KuBERT-Central-Kurdish-BERT-Model harnesses the BERT framework to enhance computational linguistics for the Central Kurdish language. This initiative is a response to the scarcity of resources and computational models for Kurdish, which is a language with substantial linguistic diversity.
8
+
9
+ ## Data Acquisition for Model Training
10
+ Data collection is a significant hurdle in training deep learning models, especially for low-resource languages like Kurdish. Sourcing sufficient data is essential for the efficacy of complex models such as BERT. The scarcity of digital resources makes accumulating Kurdish data more challenging than for many other languages. To amass a comprehensive word vector dataset for Kurdish, substantial efforts were made to compile information from various sources.
11
+
12
+ ### Corpus Compilation
13
+ Three main corpora were utilized to train the Kurdish BERT model, amounting to 296.5 million tokens:
14
+
15
+ - **AsoSoft corpus**: With 188 million tokens, it includes data from websites, textbooks, and magazines.
16
+ - **AramRafeq and Muhammad Azizi corpus**: A collection of over 60 million tokens gathered from Kurdish websites.
17
+ - **Oscar 2019 corpus**: Comprising 48.5 million words, it further enriches the dataset.
18
+
19
+ This comprehensive text corpus ensures that the KuBERT model is well-equipped to understand and process Kurdish at a high level.
20
+
21
+ ## Overview
22
+ The project uses the latest advances in BERT technology to better understand and process Kurdish language data. The model training incorporates a Kurdish-specific tokenizer and various classifiers, demonstrating BERT's adaptability to linguistic intricacies.
23
+
24
+ **from transformers import BertTokenizer, BertModel**
25
+
26
+ **tokenizer = BertTokenizer.from_pretrained('asosoft/KuBERT-Central-Kurdish-BERT-Model')**
27
+ **model = BertModel.from_pretrained('asosoft/KuBERT-Central-Kurdish-BERT-Model')**
28
+
29
+ ## Contributions
30
+ The integration of BERT represents a significant step forward in computational linguistics for Kurdish, providing a much-needed benchmark for future NLP efforts in under-represented languages. By leveraging a large corpus of Kurdish text, this project addresses critical gaps in language processing tools for Kurdish.
31
+
32
+ ## Training Details
33
+ The BERT model undergoes extensive fine-tuning with the curated Kurdish dataset, ensuring optimal performance. Through rigorous training and evaluation, the model is prepared to handle a variety of linguistic tasks.
34
+
35
+ ## Final Remarks
36
+ This README encapsulates the essence of the KuBERT-Central-Kurdish-BERT-Model project, its data acquisition efforts, and the innovative use of BERT for the Kurdish language. For a full understanding of the model's capabilities and comprehensive training details, the full documentation and accompanying study materials should be consulted.
37
+
38
+ ### Relevant Links and References
39
+ - Oscar 2019 corpus: [https://oscar-corpus.com/post/oscar-2019/](https://oscar-corpus.com/post/oscar-2019/)
40
+ - AsoSoft Kurdish Text Corpus: [https://github.com/AsoSoft/AsoSoft-Text-Corpus](https://github.com/AsoSoft/AsoSoft-Text-Corpus)
41
+ - Kurdish Resources by Muhammad Azizi and AramRafeq: [https://github.com/DevelopersTree/KurdishResources/](https://github.com/DevelopersTree/KurdishResources/)
42
+
43
+ ---
44
+
45
+ *Epochs: 3
46
+ *Max Token Length: 256
47
+ *Learning Rate: 1.00E-05
48
+ *Dropout Rate: 0.3
49
+ *Batch Size: 8
50
+ *GPU Utilization: Yes
51
+
52
+ ---
53
+
54
+ The corpus data tables and the detailed methodology can be found in the full research paper and are summarized here for quick reference:
55
+
56
+ ### Corpus Data Tables Summary
57
+
58
+ **Table 1: AsoSoft Kurdish Text Corpus**
59
+ | Source | Number of Tokens |
60
+ |---------------------------|------------------|
61
+ | Crawled From Websites | 95M |
62
+ | Text Books | 45M |
63
+ | Magazines | 48M |
64
+ | **Sum** | **188M** |
65
+
66
+ **Table 2: Muhammad Azizi and AramRafeq Text Corpus**
67
+ | Source | Number of Tokens |
68
+ |----------------------|------------------|
69
+ | Wikipedia | 13.5M |
70
+ | Wishe Website | 11M |
71
+ | Speemedia Website | 6.5M |
72
+ | Kurdiu Website | 19M |
73
+ | Dengiamerika Website | 2M |
74
+ | Chawg Website | 8M |
75
+ | **Sum** | **60M** |
76
+
77
+ **Table 3: The Kurdish Text Corpus Used to Train BERT**
78
+ | Corpus Name | Number of Tokens |
79
+ |------------------------------------|------------------|
80
+ | Oscar 2019 corpus | 48.5M |
81
+ | AsoSoft corpus | 188M |
82
+ | Muhammad Azizi and AramRafeq corpus| 60M |
83
+ | **Sum** | **296.5M** |
84
+
85
+ ## Cite
86
+ If you are using our text corpus cite us.
87
+
88
+ Awlla, K.M., Veisi, H. & Abdullah, A.A. Sentiment analysis in low-resource contexts: BERT’s impact on Central Kurdish. Lang Resources & Evaluation (2025). https://doi.org/10.1007/s10579-024-09805-0
89
+
90
+ ~~~
91
+ @article{awlla2025sentiment,
92
+ title={Sentiment analysis in low-resource contexts: BERT’s impact on Central Kurdish},
93
+ author={Awlla, K.M. and Veisi, H. and Abdullah, A.A.},
94
+ journal={Language Resources & Evaluation},
95
+ volume={35},
96
+ number={1},
97
+ pages={123--145}, % Replace with actual page numbers
98
+ year={2025},
99
+ publisher={Springer},
100
+ doi={10.1007/s10579-024-09805-0}
101
+ }
102
+
103
+ ~~~
bert/added_tokens.json ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|af|>": 50327,
3
+ "<|am|>": 50334,
4
+ "<|as|>": 50350,
5
+ "<|az|>": 50304,
6
+ "<|ba|>": 50355,
7
+ "<|be|>": 50330,
8
+ "<|bg|>": 50292,
9
+ "<|bn|>": 50302,
10
+ "<|bo|>": 50347,
11
+ "<|br|>": 50309,
12
+ "<|bs|>": 50315,
13
+ "<|cy|>": 50297,
14
+ "<|endoftext|>": 50106,
15
+ "<|et|>": 50307,
16
+ "<|eu|>": 50310,
17
+ "<|fa|>": 50300,
18
+ "<|fo|>": 50338,
19
+ "<|gl|>": 50319,
20
+ "<|gu|>": 50333,
21
+ "<|haw|>": 50352,
22
+ "<|ha|>": 50354,
23
+ "<|hr|>": 50291,
24
+ "<|ht|>": 50339,
25
+ "<|hu|>": 50286,
26
+ "<|hy|>": 50312,
27
+ "<|is|>": 50311,
28
+ "<|jw|>": 50356,
29
+ "<|ka|>": 50329,
30
+ "<|kk|>": 50316,
31
+ "<|km|>": 50323,
32
+ "<|kn|>": 50306,
33
+ "<|la|>": 50294,
34
+ "<|lb|>": 50345,
35
+ "<|ln|>": 50353,
36
+ "<|lo|>": 50336,
37
+ "<|lt|>": 50293,
38
+ "<|lv|>": 50301,
39
+ "<|mg|>": 50349,
40
+ "<|mi|>": 50295,
41
+ "<|mk|>": 50308,
42
+ "<|ml|>": 50296,
43
+ "<|mn|>": 50314,
44
+ "<|mr|>": 50320,
45
+ "<|mt|>": 50343,
46
+ "<|my|>": 50346,
47
+ "<|ne|>": 50313,
48
+ "<|nn|>": 50342,
49
+ "<|nocaptions|>": 50362,
50
+ "<|notimestamps|>": 50363,
51
+ "<|no|>": 50288,
52
+ "<|oc|>": 50328,
53
+ "<|pa|>": 50321,
54
+ "<|ps|>": 50340,
55
+ "<|sa|>": 50344,
56
+ "<|sd|>": 50332,
57
+ "<|si|>": 50322,
58
+ "<|sk|>": 50298,
59
+ "<|sl|>": 50305,
60
+ "<|sn|>": 50324,
61
+ "<|so|>": 50326,
62
+ "<|sq|>": 50317,
63
+ "<|sr|>": 50303,
64
+ "<|startoflm|>": 50360,
65
+ "<|startofprev|>": 50361,
66
+ "<|su|>": 50357,
67
+ "<|sw|>": 50318,
68
+ "<|ta|>": 50287,
69
+ "<|te|>": 50299,
70
+ "<|tg|>": 50331,
71
+ "<|th|>": 50289,
72
+ "<|tk|>": 50341,
73
+ "<|tl|>": 50348,
74
+ "<|transcribe|>": 50359,
75
+ "<|translate|>": 50358,
76
+ "<|tt|>": 50351,
77
+ "<|ur|>": 50290,
78
+ "<|uz|>": 50337,
79
+ "<|yi|>": 50335,
80
+ "<|yo|>": 50325,
81
+ "<|¡|>": 50107,
82
+ "<|¢|>": 50108,
83
+ "<|£|>": 50109,
84
+ "<|¤|>": 50110,
85
+ "<|¥|>": 50111,
86
+ "<|¦|>": 50112,
87
+ "<|§|>": 50113,
88
+ "<|¨|>": 50114,
89
+ "<|©|>": 50115,
90
+ "<|ª|>": 50116,
91
+ "<|«|>": 50117,
92
+ "<|¬|>": 50118,
93
+ "<|®|>": 50119,
94
+ "<|¯|>": 50120,
95
+ "<|°|>": 50121,
96
+ "<|±|>": 50122,
97
+ "<|²|>": 50123,
98
+ "<|³|>": 50124,
99
+ "<|´|>": 50125,
100
+ "<|µ|>": 50126,
101
+ "<|¶|>": 50127,
102
+ "<|·|>": 50128,
103
+ "<|¸|>": 50129,
104
+ "<|¹|>": 50130,
105
+ "<|º|>": 50131,
106
+ "<|»|>": 50132,
107
+ "<|¼|>": 50133,
108
+ "<|½|>": 50134,
109
+ "<|¾|>": 50135,
110
+ "<|¿|>": 50136,
111
+ "<|À|>": 50137,
112
+ "<|Á|>": 50138,
113
+ "<|Â|>": 50139,
114
+ "<|Ã|>": 50140,
115
+ "<|Ä|>": 50141,
116
+ "<|Å|>": 50142,
117
+ "<|Æ|>": 50143,
118
+ "<|Ç|>": 50144,
119
+ "<|È|>": 50145,
120
+ "<|É|>": 50146,
121
+ "<|Ê|>": 50147,
122
+ "<|Ë|>": 50148,
123
+ "<|Ì|>": 50149,
124
+ "<|Í|>": 50150,
125
+ "<|Î|>": 50151,
126
+ "<|Ï|>": 50152,
127
+ "<|Ð|>": 50153,
128
+ "<|Ñ|>": 50154,
129
+ "<|Ò|>": 50155,
130
+ "<|Ó|>": 50156,
131
+ "<|Ô|>": 50157,
132
+ "<|Õ|>": 50158,
133
+ "<|Ö|>": 50159,
134
+ "<|×|>": 50160,
135
+ "<|Ø|>": 50161,
136
+ "<|ا|>": 50271,
137
+ "<|اÙĨ|>": 50279,
138
+ "<|ت|>": 50278,
139
+ "<|د|>": 50281,
140
+ "<|ر|>": 50275,
141
+ "<|س|>": 50285,
142
+ "<|Ù|>": 50162,
143
+ "<|Ùħ|>": 50282,
144
+ "<|ÙĨ|>": 50274,
145
+ "<|ÙĪ|>": 50273,
146
+ "<|Ú|>": 50163,
147
+ "<|Ú©|>": 50276,
148
+ "<|Û|>": 50164,
149
+ "<|ÛĨ|>": 50284,
150
+ "<|ÛĮ|>": 50270,
151
+ "<|Ûİ|>": 50280,
152
+ "<|Ûķ|>": 50269,
153
+ "<|Ü|>": 50165,
154
+ "<|Ý|>": 50166,
155
+ "<|Þ|>": 50167,
156
+ "<|ß|>": 50168,
157
+ "<|à|>": 50169,
158
+ "<|á|>": 50170,
159
+ "<|â|>": 50171,
160
+ "<|ã|>": 50172,
161
+ "<|ä|>": 50173,
162
+ "<|å|>": 50174,
163
+ "<|æ|>": 50175,
164
+ "<|ç|>": 50176,
165
+ "<|è|>": 50177,
166
+ "<|é|>": 50178,
167
+ "<|ê|>": 50179,
168
+ "<|ë|>": 50180,
169
+ "<|ì|>": 50181,
170
+ "<|í|>": 50182,
171
+ "<|î|>": 50183,
172
+ "<|ï|>": 50184,
173
+ "<|ð|>": 50185,
174
+ "<|ñ|>": 50186,
175
+ "<|ò|>": 50187,
176
+ "<|ó|>": 50188,
177
+ "<|ô|>": 50189,
178
+ "<|õ|>": 50190,
179
+ "<|ö|>": 50191,
180
+ "<|÷|>": 50192,
181
+ "<|ø|>": 50193,
182
+ "<|ù|>": 50194,
183
+ "<|ú|>": 50195,
184
+ "<|û|>": 50196,
185
+ "<|ü|>": 50197,
186
+ "<|ý|>": 50198,
187
+ "<|þ|>": 50199,
188
+ "<|ÿ|>": 50200,
189
+ "<|Ā|>": 50201,
190
+ "<|ā|>": 50202,
191
+ "<|Ă|>": 50203,
192
+ "<|ă|>": 50204,
193
+ "<|Ą|>": 50205,
194
+ "<|ą|>": 50206,
195
+ "<|Ć|>": 50207,
196
+ "<|ć|>": 50208,
197
+ "<|Ĉ|>": 50209,
198
+ "<|ĉ|>": 50210,
199
+ "<|Ċ|>": 50211,
200
+ "<|ċ|>": 50212,
201
+ "<|Č|>": 50213,
202
+ "<|č|>": 50214,
203
+ "<|Ď|>": 50215,
204
+ "<|ď|>": 50216,
205
+ "<|Đ|>": 50217,
206
+ "<|đ|>": 50218,
207
+ "<|Ē|>": 50219,
208
+ "<|ē|>": 50220,
209
+ "<|Ĕ|>": 50221,
210
+ "<|ĕ|>": 50222,
211
+ "<|Ė|>": 50223,
212
+ "<|ė|>": 50224,
213
+ "<|Ę|>": 50225,
214
+ "<|ę|>": 50226,
215
+ "<|Ě|>": 50227,
216
+ "<|ě|>": 50228,
217
+ "<|Ĝ|>": 50229,
218
+ "<|ĝ|>": 50230,
219
+ "<|Ğ|>": 50231,
220
+ "<|ğ|>": 50232,
221
+ "<|Ġ|>": 50233,
222
+ "<|ĠØ|>": 50272,
223
+ "<|Ġب|>": 50283,
224
+ "<|ĠÙ|>": 50277,
225
+ "<|ġ|>": 50234,
226
+ "<|Ģ|>": 50235,
227
+ "<|ģ|>": 50236,
228
+ "<|Ĥ|>": 50237,
229
+ "<|ĥ|>": 50238,
230
+ "<|Ħ|>": 50239,
231
+ "<|ħ|>": 50240,
232
+ "<|Ĩ|>": 50241,
233
+ "<|ĩ|>": 50242,
234
+ "<|Ī|>": 50243,
235
+ "<|ī|>": 50244,
236
+ "<|Ĭ|>": 50245,
237
+ "<|ĭ|>": 50246,
238
+ "<|Į|>": 50247,
239
+ "<|į|>": 50248,
240
+ "<|İ|>": 50249,
241
+ "<|ı|>": 50250,
242
+ "<|IJ|>": 50251,
243
+ "<|ij|>": 50252,
244
+ "<|Ĵ|>": 50253,
245
+ "<|ĵ|>": 50254,
246
+ "<|Ķ|>": 50255,
247
+ "<|ķ|>": 50256,
248
+ "<|ĸ|>": 50257,
249
+ "<|Ĺ|>": 50258,
250
+ "<|ĺ|>": 50259,
251
+ "<|Ļ|>": 50260,
252
+ "<|ļ|>": 50261,
253
+ "<|Ľ|>": 50262,
254
+ "<|ľ|>": 50263,
255
+ "<|Ŀ|>": 50264,
256
+ "<|ŀ|>": 50265,
257
+ "<|Ł|>": 50266,
258
+ "<|ł|>": 50267,
259
+ "<|Ń|>": 50268
260
+ }
bert/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/content/drive/MyDrive/checkpointas-2000000",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 6,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.35.2",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 50000
25
+ }
bert/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3762a42c5f3b268866e7b6cfb60314fb77bf48c0e933a45f12c67337a4b165b1
3
+ size 327667976
bert/special_tokens_map.json ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<|startoftranscript|>",
5
+ "<|¡|>",
6
+ "<|¢|>",
7
+ "<|£|>",
8
+ "<|¤|>",
9
+ "<|¥|>",
10
+ "<|¦|>",
11
+ "<|§|>",
12
+ "<|¨|>",
13
+ "<|©|>",
14
+ "<|ª|>",
15
+ "<|«|>",
16
+ "<|¬|>",
17
+ "<|®|>",
18
+ "<|¯|>",
19
+ "<|°|>",
20
+ "<|±|>",
21
+ "<|²|>",
22
+ "<|³|>",
23
+ "<|´|>",
24
+ "<|µ|>",
25
+ "<|¶|>",
26
+ "<|·|>",
27
+ "<|¸|>",
28
+ "<|¹|>",
29
+ "<|º|>",
30
+ "<|»|>",
31
+ "<|¼|>",
32
+ "<|½|>",
33
+ "<|¾|>",
34
+ "<|¿|>",
35
+ "<|À|>",
36
+ "<|Á|>",
37
+ "<|Â|>",
38
+ "<|Ã|>",
39
+ "<|Ä|>",
40
+ "<|Å|>",
41
+ "<|Æ|>",
42
+ "<|Ç|>",
43
+ "<|È|>",
44
+ "<|É|>",
45
+ "<|Ê|>",
46
+ "<|Ë|>",
47
+ "<|Ì|>",
48
+ "<|Í|>",
49
+ "<|Î|>",
50
+ "<|Ï|>",
51
+ "<|Ð|>",
52
+ "<|Ñ|>",
53
+ "<|Ò|>",
54
+ "<|Ó|>",
55
+ "<|Ô|>",
56
+ "<|Õ|>",
57
+ "<|Ö|>",
58
+ "<|×|>",
59
+ "<|Ø|>",
60
+ "<|Ù|>",
61
+ "<|Ú|>",
62
+ "<|Û|>",
63
+ "<|Ü|>",
64
+ "<|Ý|>",
65
+ "<|Þ|>",
66
+ "<|ß|>",
67
+ "<|à|>",
68
+ "<|á|>",
69
+ "<|â|>",
70
+ "<|ã|>",
71
+ "<|ä|>",
72
+ "<|å|>",
73
+ "<|æ|>",
74
+ "<|ç|>",
75
+ "<|è|>",
76
+ "<|é|>",
77
+ "<|ê|>",
78
+ "<|ë|>",
79
+ "<|ì|>",
80
+ "<|í|>",
81
+ "<|î|>",
82
+ "<|ï|>",
83
+ "<|ð|>",
84
+ "<|ñ|>",
85
+ "<|ò|>",
86
+ "<|ó|>",
87
+ "<|ô|>",
88
+ "<|õ|>",
89
+ "<|ö|>",
90
+ "<|÷|>",
91
+ "<|ø|>",
92
+ "<|ù|>",
93
+ "<|ú|>",
94
+ "<|û|>",
95
+ "<|ü|>",
96
+ "<|ý|>",
97
+ "<|þ|>",
98
+ "<|ÿ|>",
99
+ "<|Ā|>",
100
+ "<|ā|>",
101
+ "<|Ă|>",
102
+ "<|ă|>",
103
+ "<|Ą|>",
104
+ "<|ą|>",
105
+ "<|Ć|>",
106
+ "<|ć|>",
107
+ "<|Ĉ|>",
108
+ "<|ĉ|>",
109
+ "<|Ċ|>",
110
+ "<|ċ|>",
111
+ "<|Č|>",
112
+ "<|č|>",
113
+ "<|Ď|>",
114
+ "<|ď|>",
115
+ "<|Đ|>",
116
+ "<|đ|>",
117
+ "<|Ē|>",
118
+ "<|ē|>",
119
+ "<|Ĕ|>",
120
+ "<|ĕ|>",
121
+ "<|Ė|>",
122
+ "<|ė|>",
123
+ "<|Ę|>",
124
+ "<|ę|>",
125
+ "<|Ě|>",
126
+ "<|ě|>",
127
+ "<|Ĝ|>",
128
+ "<|ĝ|>",
129
+ "<|Ğ|>",
130
+ "<|ğ|>",
131
+ "<|Ġ|>",
132
+ "<|ġ|>",
133
+ "<|Ģ|>",
134
+ "<|ģ|>",
135
+ "<|Ĥ|>",
136
+ "<|ĥ|>",
137
+ "<|Ħ|>",
138
+ "<|ħ|>",
139
+ "<|Ĩ|>",
140
+ "<|ĩ|>",
141
+ "<|Ī|>",
142
+ "<|ī|>",
143
+ "<|Ĭ|>",
144
+ "<|ĭ|>",
145
+ "<|Į|>",
146
+ "<|į|>",
147
+ "<|İ|>",
148
+ "<|ı|>",
149
+ "<|IJ|>",
150
+ "<|ij|>",
151
+ "<|Ĵ|>",
152
+ "<|ĵ|>",
153
+ "<|Ķ|>",
154
+ "<|ķ|>",
155
+ "<|ĸ|>",
156
+ "<|Ĺ|>",
157
+ "<|ĺ|>",
158
+ "<|Ļ|>",
159
+ "<|ļ|>",
160
+ "<|Ľ|>",
161
+ "<|ľ|>",
162
+ "<|Ŀ|>",
163
+ "<|ŀ|>",
164
+ "<|Ł|>",
165
+ "<|ł|>",
166
+ "<|Ń|>",
167
+ "<|Ûķ|>",
168
+ "<|ÛĮ|>",
169
+ "<|ا|>",
170
+ "<|ĠØ|>",
171
+ "<|ÙĪ|>",
172
+ "<|ÙĨ|>",
173
+ "<|ر|>",
174
+ "<|Ú©|>",
175
+ "<|ĠÙ|>",
176
+ "<|ت|>",
177
+ "<|اÙĨ|>",
178
+ "<|Ûİ|>",
179
+ "<|د|>",
180
+ "<|Ùħ|>",
181
+ "<|Ġب|>",
182
+ "<|ÛĨ|>",
183
+ "<|س|>",
184
+ "<|translate|>",
185
+ "<|transcribe|>",
186
+ "<|startoflm|>",
187
+ "<|startofprev|>",
188
+ "<|nocaptions|>",
189
+ "<|notimestamps|>"
190
+ ],
191
+ "bos_token": {
192
+ "content": "<|endoftext|>",
193
+ "lstrip": false,
194
+ "normalized": true,
195
+ "rstrip": false,
196
+ "single_word": false
197
+ },
198
+ "cls_token": "[CLS]",
199
+ "eos_token": {
200
+ "content": "<|endoftext|>",
201
+ "lstrip": false,
202
+ "normalized": true,
203
+ "rstrip": false,
204
+ "single_word": false
205
+ },
206
+ "mask_token": "[MASK]",
207
+ "pad_token": "<|endoftext|>",
208
+ "sep_token": "[SEP]",
209
+ "unk_token": {
210
+ "content": "<|endoftext|>",
211
+ "lstrip": false,
212
+ "normalized": true,
213
+ "rstrip": false,
214
+ "single_word": false
215
+ }
216
+ }
bert/tokenizer_config.json ADDED
@@ -0,0 +1,2301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "2": {
6
+ "content": "[CLS]",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "3": {
14
+ "content": "[SEP]",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "4": {
22
+ "content": "[MASK]",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "50106": {
30
+ "content": "<|endoftext|>",
31
+ "lstrip": false,
32
+ "normalized": true,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "50107": {
38
+ "content": "<|¡|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "50108": {
46
+ "content": "<|¢|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "50109": {
54
+ "content": "<|£|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "50110": {
62
+ "content": "<|¤|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "50111": {
70
+ "content": "<|¥|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "50112": {
78
+ "content": "<|¦|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "50113": {
86
+ "content": "<|§|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "50114": {
94
+ "content": "<|¨|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "50115": {
102
+ "content": "<|©|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "50116": {
110
+ "content": "<|ª|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "50117": {
118
+ "content": "<|«|>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "50118": {
126
+ "content": "<|¬|>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": true
132
+ },
133
+ "50119": {
134
+ "content": "<|®|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": true
140
+ },
141
+ "50120": {
142
+ "content": "<|¯|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": true
148
+ },
149
+ "50121": {
150
+ "content": "<|°|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": true
156
+ },
157
+ "50122": {
158
+ "content": "<|±|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": true
164
+ },
165
+ "50123": {
166
+ "content": "<|²|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": true
172
+ },
173
+ "50124": {
174
+ "content": "<|³|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": true
180
+ },
181
+ "50125": {
182
+ "content": "<|´|>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ },
189
+ "50126": {
190
+ "content": "<|µ|>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": true
196
+ },
197
+ "50127": {
198
+ "content": "<|¶|>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": true
204
+ },
205
+ "50128": {
206
+ "content": "<|·|>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": true
212
+ },
213
+ "50129": {
214
+ "content": "<|¸|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "50130": {
222
+ "content": "<|¹|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "50131": {
230
+ "content": "<|º|>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "50132": {
238
+ "content": "<|»|>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "50133": {
246
+ "content": "<|¼|>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "50134": {
254
+ "content": "<|½|>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "50135": {
262
+ "content": "<|¾|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ },
269
+ "50136": {
270
+ "content": "<|¿|>",
271
+ "lstrip": false,
272
+ "normalized": false,
273
+ "rstrip": false,
274
+ "single_word": false,
275
+ "special": true
276
+ },
277
+ "50137": {
278
+ "content": "<|À|>",
279
+ "lstrip": false,
280
+ "normalized": false,
281
+ "rstrip": false,
282
+ "single_word": false,
283
+ "special": true
284
+ },
285
+ "50138": {
286
+ "content": "<|Á|>",
287
+ "lstrip": false,
288
+ "normalized": false,
289
+ "rstrip": false,
290
+ "single_word": false,
291
+ "special": true
292
+ },
293
+ "50139": {
294
+ "content": "<|Â|>",
295
+ "lstrip": false,
296
+ "normalized": false,
297
+ "rstrip": false,
298
+ "single_word": false,
299
+ "special": true
300
+ },
301
+ "50140": {
302
+ "content": "<|Ã|>",
303
+ "lstrip": false,
304
+ "normalized": false,
305
+ "rstrip": false,
306
+ "single_word": false,
307
+ "special": true
308
+ },
309
+ "50141": {
310
+ "content": "<|Ä|>",
311
+ "lstrip": false,
312
+ "normalized": false,
313
+ "rstrip": false,
314
+ "single_word": false,
315
+ "special": true
316
+ },
317
+ "50142": {
318
+ "content": "<|Å|>",
319
+ "lstrip": false,
320
+ "normalized": false,
321
+ "rstrip": false,
322
+ "single_word": false,
323
+ "special": true
324
+ },
325
+ "50143": {
326
+ "content": "<|Æ|>",
327
+ "lstrip": false,
328
+ "normalized": false,
329
+ "rstrip": false,
330
+ "single_word": false,
331
+ "special": true
332
+ },
333
+ "50144": {
334
+ "content": "<|Ç|>",
335
+ "lstrip": false,
336
+ "normalized": false,
337
+ "rstrip": false,
338
+ "single_word": false,
339
+ "special": true
340
+ },
341
+ "50145": {
342
+ "content": "<|È|>",
343
+ "lstrip": false,
344
+ "normalized": false,
345
+ "rstrip": false,
346
+ "single_word": false,
347
+ "special": true
348
+ },
349
+ "50146": {
350
+ "content": "<|É|>",
351
+ "lstrip": false,
352
+ "normalized": false,
353
+ "rstrip": false,
354
+ "single_word": false,
355
+ "special": true
356
+ },
357
+ "50147": {
358
+ "content": "<|Ê|>",
359
+ "lstrip": false,
360
+ "normalized": false,
361
+ "rstrip": false,
362
+ "single_word": false,
363
+ "special": true
364
+ },
365
+ "50148": {
366
+ "content": "<|Ë|>",
367
+ "lstrip": false,
368
+ "normalized": false,
369
+ "rstrip": false,
370
+ "single_word": false,
371
+ "special": true
372
+ },
373
+ "50149": {
374
+ "content": "<|Ì|>",
375
+ "lstrip": false,
376
+ "normalized": false,
377
+ "rstrip": false,
378
+ "single_word": false,
379
+ "special": true
380
+ },
381
+ "50150": {
382
+ "content": "<|Í|>",
383
+ "lstrip": false,
384
+ "normalized": false,
385
+ "rstrip": false,
386
+ "single_word": false,
387
+ "special": true
388
+ },
389
+ "50151": {
390
+ "content": "<|Î|>",
391
+ "lstrip": false,
392
+ "normalized": false,
393
+ "rstrip": false,
394
+ "single_word": false,
395
+ "special": true
396
+ },
397
+ "50152": {
398
+ "content": "<|Ï|>",
399
+ "lstrip": false,
400
+ "normalized": false,
401
+ "rstrip": false,
402
+ "single_word": false,
403
+ "special": true
404
+ },
405
+ "50153": {
406
+ "content": "<|Ð|>",
407
+ "lstrip": false,
408
+ "normalized": false,
409
+ "rstrip": false,
410
+ "single_word": false,
411
+ "special": true
412
+ },
413
+ "50154": {
414
+ "content": "<|Ñ|>",
415
+ "lstrip": false,
416
+ "normalized": false,
417
+ "rstrip": false,
418
+ "single_word": false,
419
+ "special": true
420
+ },
421
+ "50155": {
422
+ "content": "<|Ò|>",
423
+ "lstrip": false,
424
+ "normalized": false,
425
+ "rstrip": false,
426
+ "single_word": false,
427
+ "special": true
428
+ },
429
+ "50156": {
430
+ "content": "<|Ó|>",
431
+ "lstrip": false,
432
+ "normalized": false,
433
+ "rstrip": false,
434
+ "single_word": false,
435
+ "special": true
436
+ },
437
+ "50157": {
438
+ "content": "<|Ô|>",
439
+ "lstrip": false,
440
+ "normalized": false,
441
+ "rstrip": false,
442
+ "single_word": false,
443
+ "special": true
444
+ },
445
+ "50158": {
446
+ "content": "<|Õ|>",
447
+ "lstrip": false,
448
+ "normalized": false,
449
+ "rstrip": false,
450
+ "single_word": false,
451
+ "special": true
452
+ },
453
+ "50159": {
454
+ "content": "<|Ö|>",
455
+ "lstrip": false,
456
+ "normalized": false,
457
+ "rstrip": false,
458
+ "single_word": false,
459
+ "special": true
460
+ },
461
+ "50160": {
462
+ "content": "<|×|>",
463
+ "lstrip": false,
464
+ "normalized": false,
465
+ "rstrip": false,
466
+ "single_word": false,
467
+ "special": true
468
+ },
469
+ "50161": {
470
+ "content": "<|Ø|>",
471
+ "lstrip": false,
472
+ "normalized": false,
473
+ "rstrip": false,
474
+ "single_word": false,
475
+ "special": true
476
+ },
477
+ "50162": {
478
+ "content": "<|Ù|>",
479
+ "lstrip": false,
480
+ "normalized": false,
481
+ "rstrip": false,
482
+ "single_word": false,
483
+ "special": true
484
+ },
485
+ "50163": {
486
+ "content": "<|Ú|>",
487
+ "lstrip": false,
488
+ "normalized": false,
489
+ "rstrip": false,
490
+ "single_word": false,
491
+ "special": true
492
+ },
493
+ "50164": {
494
+ "content": "<|Û|>",
495
+ "lstrip": false,
496
+ "normalized": false,
497
+ "rstrip": false,
498
+ "single_word": false,
499
+ "special": true
500
+ },
501
+ "50165": {
502
+ "content": "<|Ü|>",
503
+ "lstrip": false,
504
+ "normalized": false,
505
+ "rstrip": false,
506
+ "single_word": false,
507
+ "special": true
508
+ },
509
+ "50166": {
510
+ "content": "<|Ý|>",
511
+ "lstrip": false,
512
+ "normalized": false,
513
+ "rstrip": false,
514
+ "single_word": false,
515
+ "special": true
516
+ },
517
+ "50167": {
518
+ "content": "<|Þ|>",
519
+ "lstrip": false,
520
+ "normalized": false,
521
+ "rstrip": false,
522
+ "single_word": false,
523
+ "special": true
524
+ },
525
+ "50168": {
526
+ "content": "<|ß|>",
527
+ "lstrip": false,
528
+ "normalized": false,
529
+ "rstrip": false,
530
+ "single_word": false,
531
+ "special": true
532
+ },
533
+ "50169": {
534
+ "content": "<|à|>",
535
+ "lstrip": false,
536
+ "normalized": false,
537
+ "rstrip": false,
538
+ "single_word": false,
539
+ "special": true
540
+ },
541
+ "50170": {
542
+ "content": "<|á|>",
543
+ "lstrip": false,
544
+ "normalized": false,
545
+ "rstrip": false,
546
+ "single_word": false,
547
+ "special": true
548
+ },
549
+ "50171": {
550
+ "content": "<|â|>",
551
+ "lstrip": false,
552
+ "normalized": false,
553
+ "rstrip": false,
554
+ "single_word": false,
555
+ "special": true
556
+ },
557
+ "50172": {
558
+ "content": "<|ã|>",
559
+ "lstrip": false,
560
+ "normalized": false,
561
+ "rstrip": false,
562
+ "single_word": false,
563
+ "special": true
564
+ },
565
+ "50173": {
566
+ "content": "<|ä|>",
567
+ "lstrip": false,
568
+ "normalized": false,
569
+ "rstrip": false,
570
+ "single_word": false,
571
+ "special": true
572
+ },
573
+ "50174": {
574
+ "content": "<|å|>",
575
+ "lstrip": false,
576
+ "normalized": false,
577
+ "rstrip": false,
578
+ "single_word": false,
579
+ "special": true
580
+ },
581
+ "50175": {
582
+ "content": "<|æ|>",
583
+ "lstrip": false,
584
+ "normalized": false,
585
+ "rstrip": false,
586
+ "single_word": false,
587
+ "special": true
588
+ },
589
+ "50176": {
590
+ "content": "<|ç|>",
591
+ "lstrip": false,
592
+ "normalized": false,
593
+ "rstrip": false,
594
+ "single_word": false,
595
+ "special": true
596
+ },
597
+ "50177": {
598
+ "content": "<|è|>",
599
+ "lstrip": false,
600
+ "normalized": false,
601
+ "rstrip": false,
602
+ "single_word": false,
603
+ "special": true
604
+ },
605
+ "50178": {
606
+ "content": "<|é|>",
607
+ "lstrip": false,
608
+ "normalized": false,
609
+ "rstrip": false,
610
+ "single_word": false,
611
+ "special": true
612
+ },
613
+ "50179": {
614
+ "content": "<|ê|>",
615
+ "lstrip": false,
616
+ "normalized": false,
617
+ "rstrip": false,
618
+ "single_word": false,
619
+ "special": true
620
+ },
621
+ "50180": {
622
+ "content": "<|ë|>",
623
+ "lstrip": false,
624
+ "normalized": false,
625
+ "rstrip": false,
626
+ "single_word": false,
627
+ "special": true
628
+ },
629
+ "50181": {
630
+ "content": "<|ì|>",
631
+ "lstrip": false,
632
+ "normalized": false,
633
+ "rstrip": false,
634
+ "single_word": false,
635
+ "special": true
636
+ },
637
+ "50182": {
638
+ "content": "<|í|>",
639
+ "lstrip": false,
640
+ "normalized": false,
641
+ "rstrip": false,
642
+ "single_word": false,
643
+ "special": true
644
+ },
645
+ "50183": {
646
+ "content": "<|î|>",
647
+ "lstrip": false,
648
+ "normalized": false,
649
+ "rstrip": false,
650
+ "single_word": false,
651
+ "special": true
652
+ },
653
+ "50184": {
654
+ "content": "<|ï|>",
655
+ "lstrip": false,
656
+ "normalized": false,
657
+ "rstrip": false,
658
+ "single_word": false,
659
+ "special": true
660
+ },
661
+ "50185": {
662
+ "content": "<|ð|>",
663
+ "lstrip": false,
664
+ "normalized": false,
665
+ "rstrip": false,
666
+ "single_word": false,
667
+ "special": true
668
+ },
669
+ "50186": {
670
+ "content": "<|ñ|>",
671
+ "lstrip": false,
672
+ "normalized": false,
673
+ "rstrip": false,
674
+ "single_word": false,
675
+ "special": true
676
+ },
677
+ "50187": {
678
+ "content": "<|ò|>",
679
+ "lstrip": false,
680
+ "normalized": false,
681
+ "rstrip": false,
682
+ "single_word": false,
683
+ "special": true
684
+ },
685
+ "50188": {
686
+ "content": "<|ó|>",
687
+ "lstrip": false,
688
+ "normalized": false,
689
+ "rstrip": false,
690
+ "single_word": false,
691
+ "special": true
692
+ },
693
+ "50189": {
694
+ "content": "<|ô|>",
695
+ "lstrip": false,
696
+ "normalized": false,
697
+ "rstrip": false,
698
+ "single_word": false,
699
+ "special": true
700
+ },
701
+ "50190": {
702
+ "content": "<|õ|>",
703
+ "lstrip": false,
704
+ "normalized": false,
705
+ "rstrip": false,
706
+ "single_word": false,
707
+ "special": true
708
+ },
709
+ "50191": {
710
+ "content": "<|ö|>",
711
+ "lstrip": false,
712
+ "normalized": false,
713
+ "rstrip": false,
714
+ "single_word": false,
715
+ "special": true
716
+ },
717
+ "50192": {
718
+ "content": "<|÷|>",
719
+ "lstrip": false,
720
+ "normalized": false,
721
+ "rstrip": false,
722
+ "single_word": false,
723
+ "special": true
724
+ },
725
+ "50193": {
726
+ "content": "<|ø|>",
727
+ "lstrip": false,
728
+ "normalized": false,
729
+ "rstrip": false,
730
+ "single_word": false,
731
+ "special": true
732
+ },
733
+ "50194": {
734
+ "content": "<|ù|>",
735
+ "lstrip": false,
736
+ "normalized": false,
737
+ "rstrip": false,
738
+ "single_word": false,
739
+ "special": true
740
+ },
741
+ "50195": {
742
+ "content": "<|ú|>",
743
+ "lstrip": false,
744
+ "normalized": false,
745
+ "rstrip": false,
746
+ "single_word": false,
747
+ "special": true
748
+ },
749
+ "50196": {
750
+ "content": "<|û|>",
751
+ "lstrip": false,
752
+ "normalized": false,
753
+ "rstrip": false,
754
+ "single_word": false,
755
+ "special": true
756
+ },
757
+ "50197": {
758
+ "content": "<|ü|>",
759
+ "lstrip": false,
760
+ "normalized": false,
761
+ "rstrip": false,
762
+ "single_word": false,
763
+ "special": true
764
+ },
765
+ "50198": {
766
+ "content": "<|ý|>",
767
+ "lstrip": false,
768
+ "normalized": false,
769
+ "rstrip": false,
770
+ "single_word": false,
771
+ "special": true
772
+ },
773
+ "50199": {
774
+ "content": "<|þ|>",
775
+ "lstrip": false,
776
+ "normalized": false,
777
+ "rstrip": false,
778
+ "single_word": false,
779
+ "special": true
780
+ },
781
+ "50200": {
782
+ "content": "<|ÿ|>",
783
+ "lstrip": false,
784
+ "normalized": false,
785
+ "rstrip": false,
786
+ "single_word": false,
787
+ "special": true
788
+ },
789
+ "50201": {
790
+ "content": "<|Ā|>",
791
+ "lstrip": false,
792
+ "normalized": false,
793
+ "rstrip": false,
794
+ "single_word": false,
795
+ "special": true
796
+ },
797
+ "50202": {
798
+ "content": "<|ā|>",
799
+ "lstrip": false,
800
+ "normalized": false,
801
+ "rstrip": false,
802
+ "single_word": false,
803
+ "special": true
804
+ },
805
+ "50203": {
806
+ "content": "<|Ă|>",
807
+ "lstrip": false,
808
+ "normalized": false,
809
+ "rstrip": false,
810
+ "single_word": false,
811
+ "special": true
812
+ },
813
+ "50204": {
814
+ "content": "<|ă|>",
815
+ "lstrip": false,
816
+ "normalized": false,
817
+ "rstrip": false,
818
+ "single_word": false,
819
+ "special": true
820
+ },
821
+ "50205": {
822
+ "content": "<|Ą|>",
823
+ "lstrip": false,
824
+ "normalized": false,
825
+ "rstrip": false,
826
+ "single_word": false,
827
+ "special": true
828
+ },
829
+ "50206": {
830
+ "content": "<|ą|>",
831
+ "lstrip": false,
832
+ "normalized": false,
833
+ "rstrip": false,
834
+ "single_word": false,
835
+ "special": true
836
+ },
837
+ "50207": {
838
+ "content": "<|Ć|>",
839
+ "lstrip": false,
840
+ "normalized": false,
841
+ "rstrip": false,
842
+ "single_word": false,
843
+ "special": true
844
+ },
845
+ "50208": {
846
+ "content": "<|ć|>",
847
+ "lstrip": false,
848
+ "normalized": false,
849
+ "rstrip": false,
850
+ "single_word": false,
851
+ "special": true
852
+ },
853
+ "50209": {
854
+ "content": "<|Ĉ|>",
855
+ "lstrip": false,
856
+ "normalized": false,
857
+ "rstrip": false,
858
+ "single_word": false,
859
+ "special": true
860
+ },
861
+ "50210": {
862
+ "content": "<|ĉ|>",
863
+ "lstrip": false,
864
+ "normalized": false,
865
+ "rstrip": false,
866
+ "single_word": false,
867
+ "special": true
868
+ },
869
+ "50211": {
870
+ "content": "<|Ċ|>",
871
+ "lstrip": false,
872
+ "normalized": false,
873
+ "rstrip": false,
874
+ "single_word": false,
875
+ "special": true
876
+ },
877
+ "50212": {
878
+ "content": "<|ċ|>",
879
+ "lstrip": false,
880
+ "normalized": false,
881
+ "rstrip": false,
882
+ "single_word": false,
883
+ "special": true
884
+ },
885
+ "50213": {
886
+ "content": "<|Č|>",
887
+ "lstrip": false,
888
+ "normalized": false,
889
+ "rstrip": false,
890
+ "single_word": false,
891
+ "special": true
892
+ },
893
+ "50214": {
894
+ "content": "<|č|>",
895
+ "lstrip": false,
896
+ "normalized": false,
897
+ "rstrip": false,
898
+ "single_word": false,
899
+ "special": true
900
+ },
901
+ "50215": {
902
+ "content": "<|Ď|>",
903
+ "lstrip": false,
904
+ "normalized": false,
905
+ "rstrip": false,
906
+ "single_word": false,
907
+ "special": true
908
+ },
909
+ "50216": {
910
+ "content": "<|ď|>",
911
+ "lstrip": false,
912
+ "normalized": false,
913
+ "rstrip": false,
914
+ "single_word": false,
915
+ "special": true
916
+ },
917
+ "50217": {
918
+ "content": "<|Đ|>",
919
+ "lstrip": false,
920
+ "normalized": false,
921
+ "rstrip": false,
922
+ "single_word": false,
923
+ "special": true
924
+ },
925
+ "50218": {
926
+ "content": "<|đ|>",
927
+ "lstrip": false,
928
+ "normalized": false,
929
+ "rstrip": false,
930
+ "single_word": false,
931
+ "special": true
932
+ },
933
+ "50219": {
934
+ "content": "<|Ē|>",
935
+ "lstrip": false,
936
+ "normalized": false,
937
+ "rstrip": false,
938
+ "single_word": false,
939
+ "special": true
940
+ },
941
+ "50220": {
942
+ "content": "<|ē|>",
943
+ "lstrip": false,
944
+ "normalized": false,
945
+ "rstrip": false,
946
+ "single_word": false,
947
+ "special": true
948
+ },
949
+ "50221": {
950
+ "content": "<|Ĕ|>",
951
+ "lstrip": false,
952
+ "normalized": false,
953
+ "rstrip": false,
954
+ "single_word": false,
955
+ "special": true
956
+ },
957
+ "50222": {
958
+ "content": "<|ĕ|>",
959
+ "lstrip": false,
960
+ "normalized": false,
961
+ "rstrip": false,
962
+ "single_word": false,
963
+ "special": true
964
+ },
965
+ "50223": {
966
+ "content": "<|Ė|>",
967
+ "lstrip": false,
968
+ "normalized": false,
969
+ "rstrip": false,
970
+ "single_word": false,
971
+ "special": true
972
+ },
973
+ "50224": {
974
+ "content": "<|ė|>",
975
+ "lstrip": false,
976
+ "normalized": false,
977
+ "rstrip": false,
978
+ "single_word": false,
979
+ "special": true
980
+ },
981
+ "50225": {
982
+ "content": "<|Ę|>",
983
+ "lstrip": false,
984
+ "normalized": false,
985
+ "rstrip": false,
986
+ "single_word": false,
987
+ "special": true
988
+ },
989
+ "50226": {
990
+ "content": "<|ę|>",
991
+ "lstrip": false,
992
+ "normalized": false,
993
+ "rstrip": false,
994
+ "single_word": false,
995
+ "special": true
996
+ },
997
+ "50227": {
998
+ "content": "<|Ě|>",
999
+ "lstrip": false,
1000
+ "normalized": false,
1001
+ "rstrip": false,
1002
+ "single_word": false,
1003
+ "special": true
1004
+ },
1005
+ "50228": {
1006
+ "content": "<|ě|>",
1007
+ "lstrip": false,
1008
+ "normalized": false,
1009
+ "rstrip": false,
1010
+ "single_word": false,
1011
+ "special": true
1012
+ },
1013
+ "50229": {
1014
+ "content": "<|Ĝ|>",
1015
+ "lstrip": false,
1016
+ "normalized": false,
1017
+ "rstrip": false,
1018
+ "single_word": false,
1019
+ "special": true
1020
+ },
1021
+ "50230": {
1022
+ "content": "<|ĝ|>",
1023
+ "lstrip": false,
1024
+ "normalized": false,
1025
+ "rstrip": false,
1026
+ "single_word": false,
1027
+ "special": true
1028
+ },
1029
+ "50231": {
1030
+ "content": "<|Ğ|>",
1031
+ "lstrip": false,
1032
+ "normalized": false,
1033
+ "rstrip": false,
1034
+ "single_word": false,
1035
+ "special": true
1036
+ },
1037
+ "50232": {
1038
+ "content": "<|ğ|>",
1039
+ "lstrip": false,
1040
+ "normalized": false,
1041
+ "rstrip": false,
1042
+ "single_word": false,
1043
+ "special": true
1044
+ },
1045
+ "50233": {
1046
+ "content": "<|Ġ|>",
1047
+ "lstrip": false,
1048
+ "normalized": false,
1049
+ "rstrip": false,
1050
+ "single_word": false,
1051
+ "special": true
1052
+ },
1053
+ "50234": {
1054
+ "content": "<|ġ|>",
1055
+ "lstrip": false,
1056
+ "normalized": false,
1057
+ "rstrip": false,
1058
+ "single_word": false,
1059
+ "special": true
1060
+ },
1061
+ "50235": {
1062
+ "content": "<|Ģ|>",
1063
+ "lstrip": false,
1064
+ "normalized": false,
1065
+ "rstrip": false,
1066
+ "single_word": false,
1067
+ "special": true
1068
+ },
1069
+ "50236": {
1070
+ "content": "<|ģ|>",
1071
+ "lstrip": false,
1072
+ "normalized": false,
1073
+ "rstrip": false,
1074
+ "single_word": false,
1075
+ "special": true
1076
+ },
1077
+ "50237": {
1078
+ "content": "<|Ĥ|>",
1079
+ "lstrip": false,
1080
+ "normalized": false,
1081
+ "rstrip": false,
1082
+ "single_word": false,
1083
+ "special": true
1084
+ },
1085
+ "50238": {
1086
+ "content": "<|ĥ|>",
1087
+ "lstrip": false,
1088
+ "normalized": false,
1089
+ "rstrip": false,
1090
+ "single_word": false,
1091
+ "special": true
1092
+ },
1093
+ "50239": {
1094
+ "content": "<|Ħ|>",
1095
+ "lstrip": false,
1096
+ "normalized": false,
1097
+ "rstrip": false,
1098
+ "single_word": false,
1099
+ "special": true
1100
+ },
1101
+ "50240": {
1102
+ "content": "<|ħ|>",
1103
+ "lstrip": false,
1104
+ "normalized": false,
1105
+ "rstrip": false,
1106
+ "single_word": false,
1107
+ "special": true
1108
+ },
1109
+ "50241": {
1110
+ "content": "<|Ĩ|>",
1111
+ "lstrip": false,
1112
+ "normalized": false,
1113
+ "rstrip": false,
1114
+ "single_word": false,
1115
+ "special": true
1116
+ },
1117
+ "50242": {
1118
+ "content": "<|ĩ|>",
1119
+ "lstrip": false,
1120
+ "normalized": false,
1121
+ "rstrip": false,
1122
+ "single_word": false,
1123
+ "special": true
1124
+ },
1125
+ "50243": {
1126
+ "content": "<|Ī|>",
1127
+ "lstrip": false,
1128
+ "normalized": false,
1129
+ "rstrip": false,
1130
+ "single_word": false,
1131
+ "special": true
1132
+ },
1133
+ "50244": {
1134
+ "content": "<|ī|>",
1135
+ "lstrip": false,
1136
+ "normalized": false,
1137
+ "rstrip": false,
1138
+ "single_word": false,
1139
+ "special": true
1140
+ },
1141
+ "50245": {
1142
+ "content": "<|Ĭ|>",
1143
+ "lstrip": false,
1144
+ "normalized": false,
1145
+ "rstrip": false,
1146
+ "single_word": false,
1147
+ "special": true
1148
+ },
1149
+ "50246": {
1150
+ "content": "<|ĭ|>",
1151
+ "lstrip": false,
1152
+ "normalized": false,
1153
+ "rstrip": false,
1154
+ "single_word": false,
1155
+ "special": true
1156
+ },
1157
+ "50247": {
1158
+ "content": "<|Į|>",
1159
+ "lstrip": false,
1160
+ "normalized": false,
1161
+ "rstrip": false,
1162
+ "single_word": false,
1163
+ "special": true
1164
+ },
1165
+ "50248": {
1166
+ "content": "<|į|>",
1167
+ "lstrip": false,
1168
+ "normalized": false,
1169
+ "rstrip": false,
1170
+ "single_word": false,
1171
+ "special": true
1172
+ },
1173
+ "50249": {
1174
+ "content": "<|İ|>",
1175
+ "lstrip": false,
1176
+ "normalized": false,
1177
+ "rstrip": false,
1178
+ "single_word": false,
1179
+ "special": true
1180
+ },
1181
+ "50250": {
1182
+ "content": "<|ı|>",
1183
+ "lstrip": false,
1184
+ "normalized": false,
1185
+ "rstrip": false,
1186
+ "single_word": false,
1187
+ "special": true
1188
+ },
1189
+ "50251": {
1190
+ "content": "<|IJ|>",
1191
+ "lstrip": false,
1192
+ "normalized": false,
1193
+ "rstrip": false,
1194
+ "single_word": false,
1195
+ "special": true
1196
+ },
1197
+ "50252": {
1198
+ "content": "<|ij|>",
1199
+ "lstrip": false,
1200
+ "normalized": false,
1201
+ "rstrip": false,
1202
+ "single_word": false,
1203
+ "special": true
1204
+ },
1205
+ "50253": {
1206
+ "content": "<|Ĵ|>",
1207
+ "lstrip": false,
1208
+ "normalized": false,
1209
+ "rstrip": false,
1210
+ "single_word": false,
1211
+ "special": true
1212
+ },
1213
+ "50254": {
1214
+ "content": "<|ĵ|>",
1215
+ "lstrip": false,
1216
+ "normalized": false,
1217
+ "rstrip": false,
1218
+ "single_word": false,
1219
+ "special": true
1220
+ },
1221
+ "50255": {
1222
+ "content": "<|Ķ|>",
1223
+ "lstrip": false,
1224
+ "normalized": false,
1225
+ "rstrip": false,
1226
+ "single_word": false,
1227
+ "special": true
1228
+ },
1229
+ "50256": {
1230
+ "content": "<|ķ|>",
1231
+ "lstrip": false,
1232
+ "normalized": false,
1233
+ "rstrip": false,
1234
+ "single_word": false,
1235
+ "special": true
1236
+ },
1237
+ "50257": {
1238
+ "content": "<|ĸ|>",
1239
+ "lstrip": false,
1240
+ "normalized": false,
1241
+ "rstrip": false,
1242
+ "single_word": false,
1243
+ "special": true
1244
+ },
1245
+ "50258": {
1246
+ "content": "<|Ĺ|>",
1247
+ "lstrip": false,
1248
+ "normalized": false,
1249
+ "rstrip": false,
1250
+ "single_word": false,
1251
+ "special": true
1252
+ },
1253
+ "50259": {
1254
+ "content": "<|ĺ|>",
1255
+ "lstrip": false,
1256
+ "normalized": false,
1257
+ "rstrip": false,
1258
+ "single_word": false,
1259
+ "special": true
1260
+ },
1261
+ "50260": {
1262
+ "content": "<|Ļ|>",
1263
+ "lstrip": false,
1264
+ "normalized": false,
1265
+ "rstrip": false,
1266
+ "single_word": false,
1267
+ "special": true
1268
+ },
1269
+ "50261": {
1270
+ "content": "<|ļ|>",
1271
+ "lstrip": false,
1272
+ "normalized": false,
1273
+ "rstrip": false,
1274
+ "single_word": false,
1275
+ "special": true
1276
+ },
1277
+ "50262": {
1278
+ "content": "<|Ľ|>",
1279
+ "lstrip": false,
1280
+ "normalized": false,
1281
+ "rstrip": false,
1282
+ "single_word": false,
1283
+ "special": true
1284
+ },
1285
+ "50263": {
1286
+ "content": "<|ľ|>",
1287
+ "lstrip": false,
1288
+ "normalized": false,
1289
+ "rstrip": false,
1290
+ "single_word": false,
1291
+ "special": true
1292
+ },
1293
+ "50264": {
1294
+ "content": "<|Ŀ|>",
1295
+ "lstrip": false,
1296
+ "normalized": false,
1297
+ "rstrip": false,
1298
+ "single_word": false,
1299
+ "special": true
1300
+ },
1301
+ "50265": {
1302
+ "content": "<|ŀ|>",
1303
+ "lstrip": false,
1304
+ "normalized": false,
1305
+ "rstrip": false,
1306
+ "single_word": false,
1307
+ "special": true
1308
+ },
1309
+ "50266": {
1310
+ "content": "<|Ł|>",
1311
+ "lstrip": false,
1312
+ "normalized": false,
1313
+ "rstrip": false,
1314
+ "single_word": false,
1315
+ "special": true
1316
+ },
1317
+ "50267": {
1318
+ "content": "<|ł|>",
1319
+ "lstrip": false,
1320
+ "normalized": false,
1321
+ "rstrip": false,
1322
+ "single_word": false,
1323
+ "special": true
1324
+ },
1325
+ "50268": {
1326
+ "content": "<|Ń|>",
1327
+ "lstrip": false,
1328
+ "normalized": false,
1329
+ "rstrip": false,
1330
+ "single_word": false,
1331
+ "special": true
1332
+ },
1333
+ "50269": {
1334
+ "content": "<|Ûķ|>",
1335
+ "lstrip": false,
1336
+ "normalized": false,
1337
+ "rstrip": false,
1338
+ "single_word": false,
1339
+ "special": true
1340
+ },
1341
+ "50270": {
1342
+ "content": "<|ÛĮ|>",
1343
+ "lstrip": false,
1344
+ "normalized": false,
1345
+ "rstrip": false,
1346
+ "single_word": false,
1347
+ "special": true
1348
+ },
1349
+ "50271": {
1350
+ "content": "<|ا|>",
1351
+ "lstrip": false,
1352
+ "normalized": false,
1353
+ "rstrip": false,
1354
+ "single_word": false,
1355
+ "special": true
1356
+ },
1357
+ "50272": {
1358
+ "content": "<|ĠØ|>",
1359
+ "lstrip": false,
1360
+ "normalized": false,
1361
+ "rstrip": false,
1362
+ "single_word": false,
1363
+ "special": true
1364
+ },
1365
+ "50273": {
1366
+ "content": "<|ÙĪ|>",
1367
+ "lstrip": false,
1368
+ "normalized": false,
1369
+ "rstrip": false,
1370
+ "single_word": false,
1371
+ "special": true
1372
+ },
1373
+ "50274": {
1374
+ "content": "<|ÙĨ|>",
1375
+ "lstrip": false,
1376
+ "normalized": false,
1377
+ "rstrip": false,
1378
+ "single_word": false,
1379
+ "special": true
1380
+ },
1381
+ "50275": {
1382
+ "content": "<|ر|>",
1383
+ "lstrip": false,
1384
+ "normalized": false,
1385
+ "rstrip": false,
1386
+ "single_word": false,
1387
+ "special": true
1388
+ },
1389
+ "50276": {
1390
+ "content": "<|Ú©|>",
1391
+ "lstrip": false,
1392
+ "normalized": false,
1393
+ "rstrip": false,
1394
+ "single_word": false,
1395
+ "special": true
1396
+ },
1397
+ "50277": {
1398
+ "content": "<|ĠÙ|>",
1399
+ "lstrip": false,
1400
+ "normalized": false,
1401
+ "rstrip": false,
1402
+ "single_word": false,
1403
+ "special": true
1404
+ },
1405
+ "50278": {
1406
+ "content": "<|ت|>",
1407
+ "lstrip": false,
1408
+ "normalized": false,
1409
+ "rstrip": false,
1410
+ "single_word": false,
1411
+ "special": true
1412
+ },
1413
+ "50279": {
1414
+ "content": "<|اÙĨ|>",
1415
+ "lstrip": false,
1416
+ "normalized": false,
1417
+ "rstrip": false,
1418
+ "single_word": false,
1419
+ "special": true
1420
+ },
1421
+ "50280": {
1422
+ "content": "<|Ûİ|>",
1423
+ "lstrip": false,
1424
+ "normalized": false,
1425
+ "rstrip": false,
1426
+ "single_word": false,
1427
+ "special": true
1428
+ },
1429
+ "50281": {
1430
+ "content": "<|د|>",
1431
+ "lstrip": false,
1432
+ "normalized": false,
1433
+ "rstrip": false,
1434
+ "single_word": false,
1435
+ "special": true
1436
+ },
1437
+ "50282": {
1438
+ "content": "<|Ùħ|>",
1439
+ "lstrip": false,
1440
+ "normalized": false,
1441
+ "rstrip": false,
1442
+ "single_word": false,
1443
+ "special": true
1444
+ },
1445
+ "50283": {
1446
+ "content": "<|Ġب|>",
1447
+ "lstrip": false,
1448
+ "normalized": false,
1449
+ "rstrip": false,
1450
+ "single_word": false,
1451
+ "special": true
1452
+ },
1453
+ "50284": {
1454
+ "content": "<|ÛĨ|>",
1455
+ "lstrip": false,
1456
+ "normalized": false,
1457
+ "rstrip": false,
1458
+ "single_word": false,
1459
+ "special": true
1460
+ },
1461
+ "50285": {
1462
+ "content": "<|س|>",
1463
+ "lstrip": false,
1464
+ "normalized": false,
1465
+ "rstrip": false,
1466
+ "single_word": false,
1467
+ "special": true
1468
+ },
1469
+ "50286": {
1470
+ "content": "<|hu|>",
1471
+ "lstrip": false,
1472
+ "normalized": true,
1473
+ "rstrip": false,
1474
+ "single_word": false,
1475
+ "special": false
1476
+ },
1477
+ "50287": {
1478
+ "content": "<|ta|>",
1479
+ "lstrip": false,
1480
+ "normalized": true,
1481
+ "rstrip": false,
1482
+ "single_word": false,
1483
+ "special": false
1484
+ },
1485
+ "50288": {
1486
+ "content": "<|no|>",
1487
+ "lstrip": false,
1488
+ "normalized": true,
1489
+ "rstrip": false,
1490
+ "single_word": false,
1491
+ "special": false
1492
+ },
1493
+ "50289": {
1494
+ "content": "<|th|>",
1495
+ "lstrip": false,
1496
+ "normalized": true,
1497
+ "rstrip": false,
1498
+ "single_word": false,
1499
+ "special": false
1500
+ },
1501
+ "50290": {
1502
+ "content": "<|ur|>",
1503
+ "lstrip": false,
1504
+ "normalized": true,
1505
+ "rstrip": false,
1506
+ "single_word": false,
1507
+ "special": false
1508
+ },
1509
+ "50291": {
1510
+ "content": "<|hr|>",
1511
+ "lstrip": false,
1512
+ "normalized": true,
1513
+ "rstrip": false,
1514
+ "single_word": false,
1515
+ "special": false
1516
+ },
1517
+ "50292": {
1518
+ "content": "<|bg|>",
1519
+ "lstrip": false,
1520
+ "normalized": true,
1521
+ "rstrip": false,
1522
+ "single_word": false,
1523
+ "special": false
1524
+ },
1525
+ "50293": {
1526
+ "content": "<|lt|>",
1527
+ "lstrip": false,
1528
+ "normalized": true,
1529
+ "rstrip": false,
1530
+ "single_word": false,
1531
+ "special": false
1532
+ },
1533
+ "50294": {
1534
+ "content": "<|la|>",
1535
+ "lstrip": false,
1536
+ "normalized": true,
1537
+ "rstrip": false,
1538
+ "single_word": false,
1539
+ "special": false
1540
+ },
1541
+ "50295": {
1542
+ "content": "<|mi|>",
1543
+ "lstrip": false,
1544
+ "normalized": true,
1545
+ "rstrip": false,
1546
+ "single_word": false,
1547
+ "special": false
1548
+ },
1549
+ "50296": {
1550
+ "content": "<|ml|>",
1551
+ "lstrip": false,
1552
+ "normalized": true,
1553
+ "rstrip": false,
1554
+ "single_word": false,
1555
+ "special": false
1556
+ },
1557
+ "50297": {
1558
+ "content": "<|cy|>",
1559
+ "lstrip": false,
1560
+ "normalized": true,
1561
+ "rstrip": false,
1562
+ "single_word": false,
1563
+ "special": false
1564
+ },
1565
+ "50298": {
1566
+ "content": "<|sk|>",
1567
+ "lstrip": false,
1568
+ "normalized": true,
1569
+ "rstrip": false,
1570
+ "single_word": false,
1571
+ "special": false
1572
+ },
1573
+ "50299": {
1574
+ "content": "<|te|>",
1575
+ "lstrip": false,
1576
+ "normalized": true,
1577
+ "rstrip": false,
1578
+ "single_word": false,
1579
+ "special": false
1580
+ },
1581
+ "50300": {
1582
+ "content": "<|fa|>",
1583
+ "lstrip": false,
1584
+ "normalized": true,
1585
+ "rstrip": false,
1586
+ "single_word": false,
1587
+ "special": false
1588
+ },
1589
+ "50301": {
1590
+ "content": "<|lv|>",
1591
+ "lstrip": false,
1592
+ "normalized": true,
1593
+ "rstrip": false,
1594
+ "single_word": false,
1595
+ "special": false
1596
+ },
1597
+ "50302": {
1598
+ "content": "<|bn|>",
1599
+ "lstrip": false,
1600
+ "normalized": true,
1601
+ "rstrip": false,
1602
+ "single_word": false,
1603
+ "special": false
1604
+ },
1605
+ "50303": {
1606
+ "content": "<|sr|>",
1607
+ "lstrip": false,
1608
+ "normalized": true,
1609
+ "rstrip": false,
1610
+ "single_word": false,
1611
+ "special": false
1612
+ },
1613
+ "50304": {
1614
+ "content": "<|az|>",
1615
+ "lstrip": false,
1616
+ "normalized": true,
1617
+ "rstrip": false,
1618
+ "single_word": false,
1619
+ "special": false
1620
+ },
1621
+ "50305": {
1622
+ "content": "<|sl|>",
1623
+ "lstrip": false,
1624
+ "normalized": true,
1625
+ "rstrip": false,
1626
+ "single_word": false,
1627
+ "special": false
1628
+ },
1629
+ "50306": {
1630
+ "content": "<|kn|>",
1631
+ "lstrip": false,
1632
+ "normalized": true,
1633
+ "rstrip": false,
1634
+ "single_word": false,
1635
+ "special": false
1636
+ },
1637
+ "50307": {
1638
+ "content": "<|et|>",
1639
+ "lstrip": false,
1640
+ "normalized": true,
1641
+ "rstrip": false,
1642
+ "single_word": false,
1643
+ "special": false
1644
+ },
1645
+ "50308": {
1646
+ "content": "<|mk|>",
1647
+ "lstrip": false,
1648
+ "normalized": true,
1649
+ "rstrip": false,
1650
+ "single_word": false,
1651
+ "special": false
1652
+ },
1653
+ "50309": {
1654
+ "content": "<|br|>",
1655
+ "lstrip": false,
1656
+ "normalized": true,
1657
+ "rstrip": false,
1658
+ "single_word": false,
1659
+ "special": false
1660
+ },
1661
+ "50310": {
1662
+ "content": "<|eu|>",
1663
+ "lstrip": false,
1664
+ "normalized": true,
1665
+ "rstrip": false,
1666
+ "single_word": false,
1667
+ "special": false
1668
+ },
1669
+ "50311": {
1670
+ "content": "<|is|>",
1671
+ "lstrip": false,
1672
+ "normalized": true,
1673
+ "rstrip": false,
1674
+ "single_word": false,
1675
+ "special": false
1676
+ },
1677
+ "50312": {
1678
+ "content": "<|hy|>",
1679
+ "lstrip": false,
1680
+ "normalized": true,
1681
+ "rstrip": false,
1682
+ "single_word": false,
1683
+ "special": false
1684
+ },
1685
+ "50313": {
1686
+ "content": "<|ne|>",
1687
+ "lstrip": false,
1688
+ "normalized": true,
1689
+ "rstrip": false,
1690
+ "single_word": false,
1691
+ "special": false
1692
+ },
1693
+ "50314": {
1694
+ "content": "<|mn|>",
1695
+ "lstrip": false,
1696
+ "normalized": true,
1697
+ "rstrip": false,
1698
+ "single_word": false,
1699
+ "special": false
1700
+ },
1701
+ "50315": {
1702
+ "content": "<|bs|>",
1703
+ "lstrip": false,
1704
+ "normalized": true,
1705
+ "rstrip": false,
1706
+ "single_word": false,
1707
+ "special": false
1708
+ },
1709
+ "50316": {
1710
+ "content": "<|kk|>",
1711
+ "lstrip": false,
1712
+ "normalized": true,
1713
+ "rstrip": false,
1714
+ "single_word": false,
1715
+ "special": false
1716
+ },
1717
+ "50317": {
1718
+ "content": "<|sq|>",
1719
+ "lstrip": false,
1720
+ "normalized": true,
1721
+ "rstrip": false,
1722
+ "single_word": false,
1723
+ "special": false
1724
+ },
1725
+ "50318": {
1726
+ "content": "<|sw|>",
1727
+ "lstrip": false,
1728
+ "normalized": true,
1729
+ "rstrip": false,
1730
+ "single_word": false,
1731
+ "special": false
1732
+ },
1733
+ "50319": {
1734
+ "content": "<|gl|>",
1735
+ "lstrip": false,
1736
+ "normalized": true,
1737
+ "rstrip": false,
1738
+ "single_word": false,
1739
+ "special": false
1740
+ },
1741
+ "50320": {
1742
+ "content": "<|mr|>",
1743
+ "lstrip": false,
1744
+ "normalized": true,
1745
+ "rstrip": false,
1746
+ "single_word": false,
1747
+ "special": false
1748
+ },
1749
+ "50321": {
1750
+ "content": "<|pa|>",
1751
+ "lstrip": false,
1752
+ "normalized": true,
1753
+ "rstrip": false,
1754
+ "single_word": false,
1755
+ "special": false
1756
+ },
1757
+ "50322": {
1758
+ "content": "<|si|>",
1759
+ "lstrip": false,
1760
+ "normalized": true,
1761
+ "rstrip": false,
1762
+ "single_word": false,
1763
+ "special": false
1764
+ },
1765
+ "50323": {
1766
+ "content": "<|km|>",
1767
+ "lstrip": false,
1768
+ "normalized": true,
1769
+ "rstrip": false,
1770
+ "single_word": false,
1771
+ "special": false
1772
+ },
1773
+ "50324": {
1774
+ "content": "<|sn|>",
1775
+ "lstrip": false,
1776
+ "normalized": true,
1777
+ "rstrip": false,
1778
+ "single_word": false,
1779
+ "special": false
1780
+ },
1781
+ "50325": {
1782
+ "content": "<|yo|>",
1783
+ "lstrip": false,
1784
+ "normalized": true,
1785
+ "rstrip": false,
1786
+ "single_word": false,
1787
+ "special": false
1788
+ },
1789
+ "50326": {
1790
+ "content": "<|so|>",
1791
+ "lstrip": false,
1792
+ "normalized": true,
1793
+ "rstrip": false,
1794
+ "single_word": false,
1795
+ "special": false
1796
+ },
1797
+ "50327": {
1798
+ "content": "<|af|>",
1799
+ "lstrip": false,
1800
+ "normalized": true,
1801
+ "rstrip": false,
1802
+ "single_word": false,
1803
+ "special": false
1804
+ },
1805
+ "50328": {
1806
+ "content": "<|oc|>",
1807
+ "lstrip": false,
1808
+ "normalized": true,
1809
+ "rstrip": false,
1810
+ "single_word": false,
1811
+ "special": false
1812
+ },
1813
+ "50329": {
1814
+ "content": "<|ka|>",
1815
+ "lstrip": false,
1816
+ "normalized": true,
1817
+ "rstrip": false,
1818
+ "single_word": false,
1819
+ "special": false
1820
+ },
1821
+ "50330": {
1822
+ "content": "<|be|>",
1823
+ "lstrip": false,
1824
+ "normalized": true,
1825
+ "rstrip": false,
1826
+ "single_word": false,
1827
+ "special": false
1828
+ },
1829
+ "50331": {
1830
+ "content": "<|tg|>",
1831
+ "lstrip": false,
1832
+ "normalized": true,
1833
+ "rstrip": false,
1834
+ "single_word": false,
1835
+ "special": false
1836
+ },
1837
+ "50332": {
1838
+ "content": "<|sd|>",
1839
+ "lstrip": false,
1840
+ "normalized": true,
1841
+ "rstrip": false,
1842
+ "single_word": false,
1843
+ "special": false
1844
+ },
1845
+ "50333": {
1846
+ "content": "<|gu|>",
1847
+ "lstrip": false,
1848
+ "normalized": true,
1849
+ "rstrip": false,
1850
+ "single_word": false,
1851
+ "special": false
1852
+ },
1853
+ "50334": {
1854
+ "content": "<|am|>",
1855
+ "lstrip": false,
1856
+ "normalized": true,
1857
+ "rstrip": false,
1858
+ "single_word": false,
1859
+ "special": false
1860
+ },
1861
+ "50335": {
1862
+ "content": "<|yi|>",
1863
+ "lstrip": false,
1864
+ "normalized": true,
1865
+ "rstrip": false,
1866
+ "single_word": false,
1867
+ "special": false
1868
+ },
1869
+ "50336": {
1870
+ "content": "<|lo|>",
1871
+ "lstrip": false,
1872
+ "normalized": true,
1873
+ "rstrip": false,
1874
+ "single_word": false,
1875
+ "special": false
1876
+ },
1877
+ "50337": {
1878
+ "content": "<|uz|>",
1879
+ "lstrip": false,
1880
+ "normalized": true,
1881
+ "rstrip": false,
1882
+ "single_word": false,
1883
+ "special": false
1884
+ },
1885
+ "50338": {
1886
+ "content": "<|fo|>",
1887
+ "lstrip": false,
1888
+ "normalized": true,
1889
+ "rstrip": false,
1890
+ "single_word": false,
1891
+ "special": false
1892
+ },
1893
+ "50339": {
1894
+ "content": "<|ht|>",
1895
+ "lstrip": false,
1896
+ "normalized": true,
1897
+ "rstrip": false,
1898
+ "single_word": false,
1899
+ "special": false
1900
+ },
1901
+ "50340": {
1902
+ "content": "<|ps|>",
1903
+ "lstrip": false,
1904
+ "normalized": true,
1905
+ "rstrip": false,
1906
+ "single_word": false,
1907
+ "special": false
1908
+ },
1909
+ "50341": {
1910
+ "content": "<|tk|>",
1911
+ "lstrip": false,
1912
+ "normalized": true,
1913
+ "rstrip": false,
1914
+ "single_word": false,
1915
+ "special": false
1916
+ },
1917
+ "50342": {
1918
+ "content": "<|nn|>",
1919
+ "lstrip": false,
1920
+ "normalized": true,
1921
+ "rstrip": false,
1922
+ "single_word": false,
1923
+ "special": false
1924
+ },
1925
+ "50343": {
1926
+ "content": "<|mt|>",
1927
+ "lstrip": false,
1928
+ "normalized": true,
1929
+ "rstrip": false,
1930
+ "single_word": false,
1931
+ "special": false
1932
+ },
1933
+ "50344": {
1934
+ "content": "<|sa|>",
1935
+ "lstrip": false,
1936
+ "normalized": true,
1937
+ "rstrip": false,
1938
+ "single_word": false,
1939
+ "special": false
1940
+ },
1941
+ "50345": {
1942
+ "content": "<|lb|>",
1943
+ "lstrip": false,
1944
+ "normalized": true,
1945
+ "rstrip": false,
1946
+ "single_word": false,
1947
+ "special": false
1948
+ },
1949
+ "50346": {
1950
+ "content": "<|my|>",
1951
+ "lstrip": false,
1952
+ "normalized": true,
1953
+ "rstrip": false,
1954
+ "single_word": false,
1955
+ "special": false
1956
+ },
1957
+ "50347": {
1958
+ "content": "<|bo|>",
1959
+ "lstrip": false,
1960
+ "normalized": true,
1961
+ "rstrip": false,
1962
+ "single_word": false,
1963
+ "special": false
1964
+ },
1965
+ "50348": {
1966
+ "content": "<|tl|>",
1967
+ "lstrip": false,
1968
+ "normalized": true,
1969
+ "rstrip": false,
1970
+ "single_word": false,
1971
+ "special": false
1972
+ },
1973
+ "50349": {
1974
+ "content": "<|mg|>",
1975
+ "lstrip": false,
1976
+ "normalized": true,
1977
+ "rstrip": false,
1978
+ "single_word": false,
1979
+ "special": false
1980
+ },
1981
+ "50350": {
1982
+ "content": "<|as|>",
1983
+ "lstrip": false,
1984
+ "normalized": true,
1985
+ "rstrip": false,
1986
+ "single_word": false,
1987
+ "special": false
1988
+ },
1989
+ "50351": {
1990
+ "content": "<|tt|>",
1991
+ "lstrip": false,
1992
+ "normalized": true,
1993
+ "rstrip": false,
1994
+ "single_word": false,
1995
+ "special": false
1996
+ },
1997
+ "50352": {
1998
+ "content": "<|haw|>",
1999
+ "lstrip": false,
2000
+ "normalized": true,
2001
+ "rstrip": false,
2002
+ "single_word": false,
2003
+ "special": false
2004
+ },
2005
+ "50353": {
2006
+ "content": "<|ln|>",
2007
+ "lstrip": false,
2008
+ "normalized": true,
2009
+ "rstrip": false,
2010
+ "single_word": false,
2011
+ "special": false
2012
+ },
2013
+ "50354": {
2014
+ "content": "<|ha|>",
2015
+ "lstrip": false,
2016
+ "normalized": true,
2017
+ "rstrip": false,
2018
+ "single_word": false,
2019
+ "special": false
2020
+ },
2021
+ "50355": {
2022
+ "content": "<|ba|>",
2023
+ "lstrip": false,
2024
+ "normalized": true,
2025
+ "rstrip": false,
2026
+ "single_word": false,
2027
+ "special": false
2028
+ },
2029
+ "50356": {
2030
+ "content": "<|jw|>",
2031
+ "lstrip": false,
2032
+ "normalized": true,
2033
+ "rstrip": false,
2034
+ "single_word": false,
2035
+ "special": false
2036
+ },
2037
+ "50357": {
2038
+ "content": "<|su|>",
2039
+ "lstrip": false,
2040
+ "normalized": true,
2041
+ "rstrip": false,
2042
+ "single_word": false,
2043
+ "special": false
2044
+ },
2045
+ "50358": {
2046
+ "content": "<|translate|>",
2047
+ "lstrip": false,
2048
+ "normalized": false,
2049
+ "rstrip": false,
2050
+ "single_word": false,
2051
+ "special": true
2052
+ },
2053
+ "50359": {
2054
+ "content": "<|transcribe|>",
2055
+ "lstrip": false,
2056
+ "normalized": false,
2057
+ "rstrip": false,
2058
+ "single_word": false,
2059
+ "special": true
2060
+ },
2061
+ "50360": {
2062
+ "content": "<|startoflm|>",
2063
+ "lstrip": false,
2064
+ "normalized": false,
2065
+ "rstrip": false,
2066
+ "single_word": false,
2067
+ "special": true
2068
+ },
2069
+ "50361": {
2070
+ "content": "<|startofprev|>",
2071
+ "lstrip": false,
2072
+ "normalized": false,
2073
+ "rstrip": false,
2074
+ "single_word": false,
2075
+ "special": true
2076
+ },
2077
+ "50362": {
2078
+ "content": "<|nocaptions|>",
2079
+ "lstrip": false,
2080
+ "normalized": false,
2081
+ "rstrip": false,
2082
+ "single_word": false,
2083
+ "special": true
2084
+ },
2085
+ "50363": {
2086
+ "content": "<|notimestamps|>",
2087
+ "lstrip": false,
2088
+ "normalized": false,
2089
+ "rstrip": false,
2090
+ "single_word": false,
2091
+ "special": true
2092
+ }
2093
+ },
2094
+ "additional_special_tokens": [
2095
+ "<|endoftext|>",
2096
+ "<|startoftranscript|>",
2097
+ "<|¡|>",
2098
+ "<|¢|>",
2099
+ "<|£|>",
2100
+ "<|¤|>",
2101
+ "<|¥|>",
2102
+ "<|¦|>",
2103
+ "<|§|>",
2104
+ "<|¨|>",
2105
+ "<|©|>",
2106
+ "<|ª|>",
2107
+ "<|«|>",
2108
+ "<|¬|>",
2109
+ "<|®|>",
2110
+ "<|¯|>",
2111
+ "<|°|>",
2112
+ "<|±|>",
2113
+ "<|²|>",
2114
+ "<|³|>",
2115
+ "<|´|>",
2116
+ "<|µ|>",
2117
+ "<|¶|>",
2118
+ "<|·|>",
2119
+ "<|¸|>",
2120
+ "<|¹|>",
2121
+ "<|º|>",
2122
+ "<|»|>",
2123
+ "<|¼|>",
2124
+ "<|½|>",
2125
+ "<|¾|>",
2126
+ "<|¿|>",
2127
+ "<|À|>",
2128
+ "<|Á|>",
2129
+ "<|Â|>",
2130
+ "<|Ã|>",
2131
+ "<|Ä|>",
2132
+ "<|Å|>",
2133
+ "<|Æ|>",
2134
+ "<|Ç|>",
2135
+ "<|È|>",
2136
+ "<|É|>",
2137
+ "<|Ê|>",
2138
+ "<|Ë|>",
2139
+ "<|Ì|>",
2140
+ "<|Í|>",
2141
+ "<|Î|>",
2142
+ "<|Ï|>",
2143
+ "<|Ð|>",
2144
+ "<|Ñ|>",
2145
+ "<|Ò|>",
2146
+ "<|Ó|>",
2147
+ "<|Ô|>",
2148
+ "<|Õ|>",
2149
+ "<|Ö|>",
2150
+ "<|×|>",
2151
+ "<|Ø|>",
2152
+ "<|Ù|>",
2153
+ "<|Ú|>",
2154
+ "<|Û|>",
2155
+ "<|Ü|>",
2156
+ "<|Ý|>",
2157
+ "<|Þ|>",
2158
+ "<|ß|>",
2159
+ "<|à|>",
2160
+ "<|á|>",
2161
+ "<|â|>",
2162
+ "<|ã|>",
2163
+ "<|ä|>",
2164
+ "<|å|>",
2165
+ "<|æ|>",
2166
+ "<|ç|>",
2167
+ "<|è|>",
2168
+ "<|é|>",
2169
+ "<|ê|>",
2170
+ "<|ë|>",
2171
+ "<|ì|>",
2172
+ "<|í|>",
2173
+ "<|î|>",
2174
+ "<|ï|>",
2175
+ "<|ð|>",
2176
+ "<|ñ|>",
2177
+ "<|ò|>",
2178
+ "<|ó|>",
2179
+ "<|ô|>",
2180
+ "<|õ|>",
2181
+ "<|ö|>",
2182
+ "<|÷|>",
2183
+ "<|ø|>",
2184
+ "<|ù|>",
2185
+ "<|ú|>",
2186
+ "<|û|>",
2187
+ "<|ü|>",
2188
+ "<|ý|>",
2189
+ "<|þ|>",
2190
+ "<|ÿ|>",
2191
+ "<|Ā|>",
2192
+ "<|ā|>",
2193
+ "<|Ă|>",
2194
+ "<|ă|>",
2195
+ "<|Ą|>",
2196
+ "<|ą|>",
2197
+ "<|Ć|>",
2198
+ "<|ć|>",
2199
+ "<|Ĉ|>",
2200
+ "<|ĉ|>",
2201
+ "<|Ċ|>",
2202
+ "<|ċ|>",
2203
+ "<|Č|>",
2204
+ "<|č|>",
2205
+ "<|Ď|>",
2206
+ "<|ď|>",
2207
+ "<|Đ|>",
2208
+ "<|đ|>",
2209
+ "<|Ē|>",
2210
+ "<|ē|>",
2211
+ "<|Ĕ|>",
2212
+ "<|ĕ|>",
2213
+ "<|Ė|>",
2214
+ "<|ė|>",
2215
+ "<|Ę|>",
2216
+ "<|ę|>",
2217
+ "<|Ě|>",
2218
+ "<|ě|>",
2219
+ "<|Ĝ|>",
2220
+ "<|ĝ|>",
2221
+ "<|Ğ|>",
2222
+ "<|ğ|>",
2223
+ "<|Ġ|>",
2224
+ "<|ġ|>",
2225
+ "<|Ģ|>",
2226
+ "<|ģ|>",
2227
+ "<|Ĥ|>",
2228
+ "<|ĥ|>",
2229
+ "<|Ħ|>",
2230
+ "<|ħ|>",
2231
+ "<|Ĩ|>",
2232
+ "<|ĩ|>",
2233
+ "<|Ī|>",
2234
+ "<|ī|>",
2235
+ "<|Ĭ|>",
2236
+ "<|ĭ|>",
2237
+ "<|Į|>",
2238
+ "<|į|>",
2239
+ "<|İ|>",
2240
+ "<|ı|>",
2241
+ "<|IJ|>",
2242
+ "<|ij|>",
2243
+ "<|Ĵ|>",
2244
+ "<|ĵ|>",
2245
+ "<|Ķ|>",
2246
+ "<|ķ|>",
2247
+ "<|ĸ|>",
2248
+ "<|Ĺ|>",
2249
+ "<|ĺ|>",
2250
+ "<|Ļ|>",
2251
+ "<|ļ|>",
2252
+ "<|Ľ|>",
2253
+ "<|ľ|>",
2254
+ "<|Ŀ|>",
2255
+ "<|ŀ|>",
2256
+ "<|Ł|>",
2257
+ "<|ł|>",
2258
+ "<|Ń|>",
2259
+ "<|Ûķ|>",
2260
+ "<|ÛĮ|>",
2261
+ "<|ا|>",
2262
+ "<|ĠØ|>",
2263
+ "<|ÙĪ|>",
2264
+ "<|ÙĨ|>",
2265
+ "<|ر|>",
2266
+ "<|Ú©|>",
2267
+ "<|ĠÙ|>",
2268
+ "<|ت|>",
2269
+ "<|اÙĨ|>",
2270
+ "<|Ûİ|>",
2271
+ "<|د|>",
2272
+ "<|Ùħ|>",
2273
+ "<|Ġب|>",
2274
+ "<|ÛĨ|>",
2275
+ "<|س|>",
2276
+ "<|translate|>",
2277
+ "<|transcribe|>",
2278
+ "<|startoflm|>",
2279
+ "<|startofprev|>",
2280
+ "<|nocaptions|>",
2281
+ "<|notimestamps|>"
2282
+ ],
2283
+ "bos_token": "<|endoftext|>",
2284
+ "clean_up_tokenization_spaces": true,
2285
+ "cls_token": "[CLS]",
2286
+ "do_basic_tokenize": true,
2287
+ "do_lower_case": true,
2288
+ "eos_token": "<|endoftext|>",
2289
+ "errors": "replace",
2290
+ "mask_token": "[MASK]",
2291
+ "model_max_length": 12,
2292
+ "never_split": null,
2293
+ "pad_token": "<|endoftext|>",
2294
+ "processor_class": "WhisperProcessor",
2295
+ "return_attention_mask": false,
2296
+ "sep_token": "[SEP]",
2297
+ "strip_accents": null,
2298
+ "tokenize_chinese_chars": true,
2299
+ "tokenizer_class": "BertTokenizer",
2300
+ "unk_token": "<|endoftext|>"
2301
+ }
bert/vocab.txt ADDED
The diff for this file is too large to render. See raw diff