gsaltintas commited on
Commit
fc20120
·
verified ·
1 Parent(s): a06a577

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +54 -0
  2. merges.txt +102 -0
  3. special_tokens_map.json +5 -0
  4. tokenizer.json +849 -0
  5. tokenizer_config.json +37 -0
  6. vocab.json +362 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - eng
5
+
6
+ tags:
7
+ - tokenizer
8
+ - bpe
9
+ - flexitok
10
+ - fineweb2
11
+ ---
12
+
13
+ # Byte-Level BPE Tokenizer: ['eng_Latn'] (0K)
14
+
15
+ A **Byte-Level BPE** tokenizer trained on **['eng_Latn']** data from Fineweb-2-HQ.
16
+
17
+ ## Training Details
18
+
19
+ | Parameter | Value |
20
+ |-----------|-------|
21
+ | Algorithm | Byte-Level BPE |
22
+ | Language | `['eng_Latn']` |
23
+ | Target Vocab Size | 360 |
24
+ | Final Vocab Size | 360 |
25
+ | Pre-tokenizer | custom:addition |
26
+ | Number handling | ltr_3digit |
27
+ | Contraction handling | False |
28
+ | Normalizer | NFC |
29
+ | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
30
+ | Training Shards | 2 |
31
+
32
+ ## Usage
33
+
34
+ ```python
35
+ from transformers import AutoTokenizer
36
+
37
+ tokenizer = AutoTokenizer.from_pretrained("flexitok/maddition_eng_Latn_360")
38
+ tokens = tokenizer.encode("Hello, world!")
39
+ ```
40
+
41
+ ## Files
42
+
43
+ - `tokenizer.json` — Full HuggingFace tokenizer
44
+ - `vocab.json` — Vocabulary mapping
45
+ - `merges.txt` — BPE merge rules
46
+
47
+ ## Sample Encoding
48
+ | Text | Tokens | Token IDs |
49
+ |------|--------|-----------|
50
+ | `yirmi iki+dokuz=otuz bir\ntwenty two+nine=thirty one` | `y, i, r, m, i, Ġ, i, k, i, +, d, o, k, u, z, =, o, t, u, z` | `91, 75, 84, 79, 75, 223, 75, 77, 75, 13, 70, 81, 77, 87, 92, 31, 81, 86, 87, 92` |
51
+
52
+ Command used to create this tokenizer:
53
+ ```bash
54
+ ['/home/gsa/tokenizers2/flexitok/tokenizer_training/train_tokenizers.py', 'algorithm=bpe', 'vocab_size=360', 'langs=[eng_Latn]', 'data_dir=/scratch/gsa/data/multilingual-addition/', 'output_dir=/scratch/gsa/trained_tokenizers/multilingual_addition', 'pretokenizer=custom:addition', 'number_handling=ltr_3digit', 'add_numbers=false', 'handle_contractions=false', 'unicode_normalization=nfc', 'use_byte_level_regex=false', 'byte_fallback=false', 'strip_zero_width=false', 'cjk_char_split=false', 'add_cjk_chars=false', 'max_lines=-1', 'test_string=yirmi iki+dokuz=otuz bir\\ntwenty two+nine=thirty one', 'hf.publish_to_hf=true', 'hf_repo_prefix=flexitok/', 'hf.hf_repo_id=flexitok/maddition_eng_Latn_360', 'hf.collections=[flexitok/multilingual_addition_tokenizers]']
merges.txt ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #version: 0.2
2
+ ['n', 'd']
3
+ ['r', 'e']
4
+ ['a', 'nd']
5
+ ['h', 'u']
6
+ ['nd', 're']
7
+ ['hu', 'ndre']
8
+ ['hundre', 'd']
9
+ ['t', 'y']
10
+ ['ty', '-']
11
+ ['n', 'e']
12
+ ['v', 'e']
13
+ ['t', 'h']
14
+ ['o', 'u']
15
+ ['o', 'ne']
16
+ ['e', 've']
17
+ ['eve', 'n']
18
+ ['e', 'i']
19
+ ['f', 'i']
20
+ ['g', 'h']
21
+ ['i', 'x']
22
+ ['i', 'ne']
23
+ ['n', 'ine']
24
+ ['s', 'even']
25
+ ['s', 'ix']
26
+ ['t', 'w']
27
+ ['ei', 'gh']
28
+ ['f', 'ou']
29
+ ['eigh', 't']
30
+ ['fou', 'r']
31
+ ['re', 'e']
32
+ ['th', 'ree']
33
+ ['fi', 've']
34
+ ['tw', 'o']
35
+ ['e', 'n']
36
+ ['r', 'ty-']
37
+ ['s', 'and']
38
+ ['th', 'ou']
39
+ ['thou', 'sand']
40
+ ['thousand', ',']
41
+ ['th', 'i']
42
+ ['fi', 'f']
43
+ ['f', 'o']
44
+ ['tw', 'en']
45
+ ['nine', 'ty-']
46
+ ['seven', 'ty-']
47
+ ['six', 'ty-']
48
+ ['eigh', 'ty-']
49
+ ['thi', 'rty-']
50
+ ['fif', 'ty-']
51
+ ['fo', 'rty-']
52
+ ['twen', 'ty-']
53
+ ['e', 'en']
54
+ ['t', 'een']
55
+ ['e', 'l']
56
+ ['r', 'ty']
57
+ ['r', 'teen']
58
+ ['t', 'en']
59
+ ['nine', 'ty']
60
+ ['nine', 'teen']
61
+ ['seven', 'ty']
62
+ ['seven', 'teen']
63
+ ['six', 'ty']
64
+ ['six', 'teen']
65
+ ['tw', 'el']
66
+ ['eigh', 'ty']
67
+ ['eight', 'een']
68
+ ['four', 'teen']
69
+ ['thi', 'rty']
70
+ ['thi', 'rteen']
71
+ ['fif', 'ty']
72
+ ['fif', 'teen']
73
+ ['fo', 'rty']
74
+ ['twen', 'ty']
75
+ ['ninety-', 'one']
76
+ ['ninety-', 'nine']
77
+ ['ninety-', 'seven']
78
+ ['ninety-', 'six']
79
+ ['ninety-', 'eight']
80
+ ['ninety-', 'four']
81
+ ['ninety-', 'three']
82
+ ['ninety-', 'five']
83
+ ['ninety-', 'two']
84
+ ['seventy-', 'one']
85
+ ['seventy-', 'nine']
86
+ ['seventy-', 'seven']
87
+ ['seventy-', 'six']
88
+ ['seventy-', 'eight']
89
+ ['seventy-', 'four']
90
+ ['seventy-', 'three']
91
+ ['seventy-', 'five']
92
+ ['seventy-', 'two']
93
+ ['sixty-', 'one']
94
+ ['sixty-', 'nine']
95
+ ['sixty-', 'seven']
96
+ ['sixty-', 'six']
97
+ ['sixty-', 'eight']
98
+ ['sixty-', 'four']
99
+ ['sixty-', 'three']
100
+ ['sixty-', 'five']
101
+ ['sixty-', 'two']
102
+ ['eighty-', 'one']
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "pad_token": "<pad>"
5
+ }
tokenizer.json ADDED
@@ -0,0 +1,849 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "<s>",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ },
15
+ {
16
+ "id": 1,
17
+ "content": "</s>",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": false,
22
+ "special": true
23
+ },
24
+ {
25
+ "id": 2,
26
+ "content": "<pad>",
27
+ "single_word": false,
28
+ "lstrip": false,
29
+ "rstrip": false,
30
+ "normalized": false,
31
+ "special": true
32
+ }
33
+ ],
34
+ "normalizer": {
35
+ "type": "NFC"
36
+ },
37
+ "pre_tokenizer": {
38
+ "type": "Sequence",
39
+ "pretokenizers": [
40
+ {
41
+ "type": "Split",
42
+ "pattern": {
43
+ "Regex": "[+=]|[^\\S\\r\\n]*[\\n\\r]+|[^\\S\\r\\n]+"
44
+ },
45
+ "behavior": "Isolated",
46
+ "invert": false
47
+ },
48
+ {
49
+ "type": "Split",
50
+ "pattern": {
51
+ "Regex": "\\p{N}{1,3}"
52
+ },
53
+ "behavior": "Isolated",
54
+ "invert": false
55
+ },
56
+ {
57
+ "type": "ByteLevel",
58
+ "add_prefix_space": false,
59
+ "trim_offsets": true,
60
+ "use_regex": false
61
+ }
62
+ ]
63
+ },
64
+ "post_processor": null,
65
+ "decoder": {
66
+ "type": "ByteLevel",
67
+ "add_prefix_space": true,
68
+ "trim_offsets": true,
69
+ "use_regex": true
70
+ },
71
+ "model": {
72
+ "type": "BPE",
73
+ "dropout": null,
74
+ "unk_token": null,
75
+ "continuing_subword_prefix": null,
76
+ "end_of_word_suffix": null,
77
+ "fuse_unk": false,
78
+ "byte_fallback": false,
79
+ "ignore_merges": false,
80
+ "vocab": {
81
+ "<s>": 0,
82
+ "</s>": 1,
83
+ "<pad>": 2,
84
+ "!": 3,
85
+ "\"": 4,
86
+ "#": 5,
87
+ "$": 6,
88
+ "%": 7,
89
+ "&": 8,
90
+ "'": 9,
91
+ "(": 10,
92
+ ")": 11,
93
+ "*": 12,
94
+ "+": 13,
95
+ ",": 14,
96
+ "-": 15,
97
+ ".": 16,
98
+ "/": 17,
99
+ "0": 18,
100
+ "1": 19,
101
+ "2": 20,
102
+ "3": 21,
103
+ "4": 22,
104
+ "5": 23,
105
+ "6": 24,
106
+ "7": 25,
107
+ "8": 26,
108
+ "9": 27,
109
+ ":": 28,
110
+ ";": 29,
111
+ "<": 30,
112
+ "=": 31,
113
+ ">": 32,
114
+ "?": 33,
115
+ "@": 34,
116
+ "A": 35,
117
+ "B": 36,
118
+ "C": 37,
119
+ "D": 38,
120
+ "E": 39,
121
+ "F": 40,
122
+ "G": 41,
123
+ "H": 42,
124
+ "I": 43,
125
+ "J": 44,
126
+ "K": 45,
127
+ "L": 46,
128
+ "M": 47,
129
+ "N": 48,
130
+ "O": 49,
131
+ "P": 50,
132
+ "Q": 51,
133
+ "R": 52,
134
+ "S": 53,
135
+ "T": 54,
136
+ "U": 55,
137
+ "V": 56,
138
+ "W": 57,
139
+ "X": 58,
140
+ "Y": 59,
141
+ "Z": 60,
142
+ "[": 61,
143
+ "\\": 62,
144
+ "]": 63,
145
+ "^": 64,
146
+ "_": 65,
147
+ "`": 66,
148
+ "a": 67,
149
+ "b": 68,
150
+ "c": 69,
151
+ "d": 70,
152
+ "e": 71,
153
+ "f": 72,
154
+ "g": 73,
155
+ "h": 74,
156
+ "i": 75,
157
+ "j": 76,
158
+ "k": 77,
159
+ "l": 78,
160
+ "m": 79,
161
+ "n": 80,
162
+ "o": 81,
163
+ "p": 82,
164
+ "q": 83,
165
+ "r": 84,
166
+ "s": 85,
167
+ "t": 86,
168
+ "u": 87,
169
+ "v": 88,
170
+ "w": 89,
171
+ "x": 90,
172
+ "y": 91,
173
+ "z": 92,
174
+ "{": 93,
175
+ "|": 94,
176
+ "}": 95,
177
+ "~": 96,
178
+ "¡": 97,
179
+ "¢": 98,
180
+ "£": 99,
181
+ "¤": 100,
182
+ "¥": 101,
183
+ "¦": 102,
184
+ "§": 103,
185
+ "¨": 104,
186
+ "©": 105,
187
+ "ª": 106,
188
+ "«": 107,
189
+ "¬": 108,
190
+ "®": 109,
191
+ "¯": 110,
192
+ "°": 111,
193
+ "±": 112,
194
+ "²": 113,
195
+ "³": 114,
196
+ "´": 115,
197
+ "µ": 116,
198
+ "¶": 117,
199
+ "·": 118,
200
+ "¸": 119,
201
+ "¹": 120,
202
+ "º": 121,
203
+ "»": 122,
204
+ "¼": 123,
205
+ "½": 124,
206
+ "¾": 125,
207
+ "¿": 126,
208
+ "À": 127,
209
+ "Á": 128,
210
+ "Â": 129,
211
+ "Ã": 130,
212
+ "Ä": 131,
213
+ "Å": 132,
214
+ "Æ": 133,
215
+ "Ç": 134,
216
+ "È": 135,
217
+ "É": 136,
218
+ "Ê": 137,
219
+ "Ë": 138,
220
+ "Ì": 139,
221
+ "Í": 140,
222
+ "Î": 141,
223
+ "Ï": 142,
224
+ "Ð": 143,
225
+ "Ñ": 144,
226
+ "Ò": 145,
227
+ "Ó": 146,
228
+ "Ô": 147,
229
+ "Õ": 148,
230
+ "Ö": 149,
231
+ "×": 150,
232
+ "Ø": 151,
233
+ "Ù": 152,
234
+ "Ú": 153,
235
+ "Û": 154,
236
+ "Ü": 155,
237
+ "Ý": 156,
238
+ "Þ": 157,
239
+ "ß": 158,
240
+ "à": 159,
241
+ "á": 160,
242
+ "â": 161,
243
+ "ã": 162,
244
+ "ä": 163,
245
+ "å": 164,
246
+ "æ": 165,
247
+ "ç": 166,
248
+ "è": 167,
249
+ "é": 168,
250
+ "ê": 169,
251
+ "ë": 170,
252
+ "ì": 171,
253
+ "í": 172,
254
+ "î": 173,
255
+ "ï": 174,
256
+ "ð": 175,
257
+ "ñ": 176,
258
+ "ò": 177,
259
+ "ó": 178,
260
+ "ô": 179,
261
+ "õ": 180,
262
+ "ö": 181,
263
+ "÷": 182,
264
+ "ø": 183,
265
+ "ù": 184,
266
+ "ú": 185,
267
+ "û": 186,
268
+ "ü": 187,
269
+ "ý": 188,
270
+ "þ": 189,
271
+ "ÿ": 190,
272
+ "Ā": 191,
273
+ "ā": 192,
274
+ "Ă": 193,
275
+ "ă": 194,
276
+ "Ą": 195,
277
+ "ą": 196,
278
+ "Ć": 197,
279
+ "ć": 198,
280
+ "Ĉ": 199,
281
+ "ĉ": 200,
282
+ "Ċ": 201,
283
+ "ċ": 202,
284
+ "Č": 203,
285
+ "č": 204,
286
+ "Ď": 205,
287
+ "ď": 206,
288
+ "Đ": 207,
289
+ "đ": 208,
290
+ "Ē": 209,
291
+ "ē": 210,
292
+ "Ĕ": 211,
293
+ "ĕ": 212,
294
+ "Ė": 213,
295
+ "ė": 214,
296
+ "Ę": 215,
297
+ "ę": 216,
298
+ "Ě": 217,
299
+ "ě": 218,
300
+ "Ĝ": 219,
301
+ "ĝ": 220,
302
+ "Ğ": 221,
303
+ "ğ": 222,
304
+ "Ġ": 223,
305
+ "ġ": 224,
306
+ "Ģ": 225,
307
+ "ģ": 226,
308
+ "Ĥ": 227,
309
+ "ĥ": 228,
310
+ "Ħ": 229,
311
+ "ħ": 230,
312
+ "Ĩ": 231,
313
+ "ĩ": 232,
314
+ "Ī": 233,
315
+ "ī": 234,
316
+ "Ĭ": 235,
317
+ "ĭ": 236,
318
+ "Į": 237,
319
+ "į": 238,
320
+ "İ": 239,
321
+ "ı": 240,
322
+ "IJ": 241,
323
+ "ij": 242,
324
+ "Ĵ": 243,
325
+ "ĵ": 244,
326
+ "Ķ": 245,
327
+ "ķ": 246,
328
+ "ĸ": 247,
329
+ "Ĺ": 248,
330
+ "ĺ": 249,
331
+ "Ļ": 250,
332
+ "ļ": 251,
333
+ "Ľ": 252,
334
+ "ľ": 253,
335
+ "Ŀ": 254,
336
+ "ŀ": 255,
337
+ "Ł": 256,
338
+ "ł": 257,
339
+ "Ń": 258,
340
+ "nd": 259,
341
+ "re": 260,
342
+ "and": 261,
343
+ "hu": 262,
344
+ "ndre": 263,
345
+ "hundre": 264,
346
+ "hundred": 265,
347
+ "ty": 266,
348
+ "ty-": 267,
349
+ "ne": 268,
350
+ "ve": 269,
351
+ "th": 270,
352
+ "ou": 271,
353
+ "one": 272,
354
+ "eve": 273,
355
+ "even": 274,
356
+ "ei": 275,
357
+ "fi": 276,
358
+ "gh": 277,
359
+ "ix": 278,
360
+ "ine": 279,
361
+ "nine": 280,
362
+ "seven": 281,
363
+ "six": 282,
364
+ "tw": 283,
365
+ "eigh": 284,
366
+ "fou": 285,
367
+ "eight": 286,
368
+ "four": 287,
369
+ "ree": 288,
370
+ "three": 289,
371
+ "five": 290,
372
+ "two": 291,
373
+ "en": 292,
374
+ "rty-": 293,
375
+ "sand": 294,
376
+ "thou": 295,
377
+ "thousand": 296,
378
+ "thousand,": 297,
379
+ "thi": 298,
380
+ "fif": 299,
381
+ "fo": 300,
382
+ "twen": 301,
383
+ "ninety-": 302,
384
+ "seventy-": 303,
385
+ "sixty-": 304,
386
+ "eighty-": 305,
387
+ "thirty-": 306,
388
+ "fifty-": 307,
389
+ "forty-": 308,
390
+ "twenty-": 309,
391
+ "een": 310,
392
+ "teen": 311,
393
+ "el": 312,
394
+ "rty": 313,
395
+ "rteen": 314,
396
+ "ten": 315,
397
+ "ninety": 316,
398
+ "nineteen": 317,
399
+ "seventy": 318,
400
+ "seventeen": 319,
401
+ "sixty": 320,
402
+ "sixteen": 321,
403
+ "twel": 322,
404
+ "eighty": 323,
405
+ "eighteen": 324,
406
+ "fourteen": 325,
407
+ "thirty": 326,
408
+ "thirteen": 327,
409
+ "fifty": 328,
410
+ "fifteen": 329,
411
+ "forty": 330,
412
+ "twenty": 331,
413
+ "ninety-one": 332,
414
+ "ninety-nine": 333,
415
+ "ninety-seven": 334,
416
+ "ninety-six": 335,
417
+ "ninety-eight": 336,
418
+ "ninety-four": 337,
419
+ "ninety-three": 338,
420
+ "ninety-five": 339,
421
+ "ninety-two": 340,
422
+ "seventy-one": 341,
423
+ "seventy-nine": 342,
424
+ "seventy-seven": 343,
425
+ "seventy-six": 344,
426
+ "seventy-eight": 345,
427
+ "seventy-four": 346,
428
+ "seventy-three": 347,
429
+ "seventy-five": 348,
430
+ "seventy-two": 349,
431
+ "sixty-one": 350,
432
+ "sixty-nine": 351,
433
+ "sixty-seven": 352,
434
+ "sixty-six": 353,
435
+ "sixty-eight": 354,
436
+ "sixty-four": 355,
437
+ "sixty-three": 356,
438
+ "sixty-five": 357,
439
+ "sixty-two": 358,
440
+ "eighty-one": 359
441
+ },
442
+ "merges": [
443
+ [
444
+ "n",
445
+ "d"
446
+ ],
447
+ [
448
+ "r",
449
+ "e"
450
+ ],
451
+ [
452
+ "a",
453
+ "nd"
454
+ ],
455
+ [
456
+ "h",
457
+ "u"
458
+ ],
459
+ [
460
+ "nd",
461
+ "re"
462
+ ],
463
+ [
464
+ "hu",
465
+ "ndre"
466
+ ],
467
+ [
468
+ "hundre",
469
+ "d"
470
+ ],
471
+ [
472
+ "t",
473
+ "y"
474
+ ],
475
+ [
476
+ "ty",
477
+ "-"
478
+ ],
479
+ [
480
+ "n",
481
+ "e"
482
+ ],
483
+ [
484
+ "v",
485
+ "e"
486
+ ],
487
+ [
488
+ "t",
489
+ "h"
490
+ ],
491
+ [
492
+ "o",
493
+ "u"
494
+ ],
495
+ [
496
+ "o",
497
+ "ne"
498
+ ],
499
+ [
500
+ "e",
501
+ "ve"
502
+ ],
503
+ [
504
+ "eve",
505
+ "n"
506
+ ],
507
+ [
508
+ "e",
509
+ "i"
510
+ ],
511
+ [
512
+ "f",
513
+ "i"
514
+ ],
515
+ [
516
+ "g",
517
+ "h"
518
+ ],
519
+ [
520
+ "i",
521
+ "x"
522
+ ],
523
+ [
524
+ "i",
525
+ "ne"
526
+ ],
527
+ [
528
+ "n",
529
+ "ine"
530
+ ],
531
+ [
532
+ "s",
533
+ "even"
534
+ ],
535
+ [
536
+ "s",
537
+ "ix"
538
+ ],
539
+ [
540
+ "t",
541
+ "w"
542
+ ],
543
+ [
544
+ "ei",
545
+ "gh"
546
+ ],
547
+ [
548
+ "f",
549
+ "ou"
550
+ ],
551
+ [
552
+ "eigh",
553
+ "t"
554
+ ],
555
+ [
556
+ "fou",
557
+ "r"
558
+ ],
559
+ [
560
+ "re",
561
+ "e"
562
+ ],
563
+ [
564
+ "th",
565
+ "ree"
566
+ ],
567
+ [
568
+ "fi",
569
+ "ve"
570
+ ],
571
+ [
572
+ "tw",
573
+ "o"
574
+ ],
575
+ [
576
+ "e",
577
+ "n"
578
+ ],
579
+ [
580
+ "r",
581
+ "ty-"
582
+ ],
583
+ [
584
+ "s",
585
+ "and"
586
+ ],
587
+ [
588
+ "th",
589
+ "ou"
590
+ ],
591
+ [
592
+ "thou",
593
+ "sand"
594
+ ],
595
+ [
596
+ "thousand",
597
+ ","
598
+ ],
599
+ [
600
+ "th",
601
+ "i"
602
+ ],
603
+ [
604
+ "fi",
605
+ "f"
606
+ ],
607
+ [
608
+ "f",
609
+ "o"
610
+ ],
611
+ [
612
+ "tw",
613
+ "en"
614
+ ],
615
+ [
616
+ "nine",
617
+ "ty-"
618
+ ],
619
+ [
620
+ "seven",
621
+ "ty-"
622
+ ],
623
+ [
624
+ "six",
625
+ "ty-"
626
+ ],
627
+ [
628
+ "eigh",
629
+ "ty-"
630
+ ],
631
+ [
632
+ "thi",
633
+ "rty-"
634
+ ],
635
+ [
636
+ "fif",
637
+ "ty-"
638
+ ],
639
+ [
640
+ "fo",
641
+ "rty-"
642
+ ],
643
+ [
644
+ "twen",
645
+ "ty-"
646
+ ],
647
+ [
648
+ "e",
649
+ "en"
650
+ ],
651
+ [
652
+ "t",
653
+ "een"
654
+ ],
655
+ [
656
+ "e",
657
+ "l"
658
+ ],
659
+ [
660
+ "r",
661
+ "ty"
662
+ ],
663
+ [
664
+ "r",
665
+ "teen"
666
+ ],
667
+ [
668
+ "t",
669
+ "en"
670
+ ],
671
+ [
672
+ "nine",
673
+ "ty"
674
+ ],
675
+ [
676
+ "nine",
677
+ "teen"
678
+ ],
679
+ [
680
+ "seven",
681
+ "ty"
682
+ ],
683
+ [
684
+ "seven",
685
+ "teen"
686
+ ],
687
+ [
688
+ "six",
689
+ "ty"
690
+ ],
691
+ [
692
+ "six",
693
+ "teen"
694
+ ],
695
+ [
696
+ "tw",
697
+ "el"
698
+ ],
699
+ [
700
+ "eigh",
701
+ "ty"
702
+ ],
703
+ [
704
+ "eight",
705
+ "een"
706
+ ],
707
+ [
708
+ "four",
709
+ "teen"
710
+ ],
711
+ [
712
+ "thi",
713
+ "rty"
714
+ ],
715
+ [
716
+ "thi",
717
+ "rteen"
718
+ ],
719
+ [
720
+ "fif",
721
+ "ty"
722
+ ],
723
+ [
724
+ "fif",
725
+ "teen"
726
+ ],
727
+ [
728
+ "fo",
729
+ "rty"
730
+ ],
731
+ [
732
+ "twen",
733
+ "ty"
734
+ ],
735
+ [
736
+ "ninety-",
737
+ "one"
738
+ ],
739
+ [
740
+ "ninety-",
741
+ "nine"
742
+ ],
743
+ [
744
+ "ninety-",
745
+ "seven"
746
+ ],
747
+ [
748
+ "ninety-",
749
+ "six"
750
+ ],
751
+ [
752
+ "ninety-",
753
+ "eight"
754
+ ],
755
+ [
756
+ "ninety-",
757
+ "four"
758
+ ],
759
+ [
760
+ "ninety-",
761
+ "three"
762
+ ],
763
+ [
764
+ "ninety-",
765
+ "five"
766
+ ],
767
+ [
768
+ "ninety-",
769
+ "two"
770
+ ],
771
+ [
772
+ "seventy-",
773
+ "one"
774
+ ],
775
+ [
776
+ "seventy-",
777
+ "nine"
778
+ ],
779
+ [
780
+ "seventy-",
781
+ "seven"
782
+ ],
783
+ [
784
+ "seventy-",
785
+ "six"
786
+ ],
787
+ [
788
+ "seventy-",
789
+ "eight"
790
+ ],
791
+ [
792
+ "seventy-",
793
+ "four"
794
+ ],
795
+ [
796
+ "seventy-",
797
+ "three"
798
+ ],
799
+ [
800
+ "seventy-",
801
+ "five"
802
+ ],
803
+ [
804
+ "seventy-",
805
+ "two"
806
+ ],
807
+ [
808
+ "sixty-",
809
+ "one"
810
+ ],
811
+ [
812
+ "sixty-",
813
+ "nine"
814
+ ],
815
+ [
816
+ "sixty-",
817
+ "seven"
818
+ ],
819
+ [
820
+ "sixty-",
821
+ "six"
822
+ ],
823
+ [
824
+ "sixty-",
825
+ "eight"
826
+ ],
827
+ [
828
+ "sixty-",
829
+ "four"
830
+ ],
831
+ [
832
+ "sixty-",
833
+ "three"
834
+ ],
835
+ [
836
+ "sixty-",
837
+ "five"
838
+ ],
839
+ [
840
+ "sixty-",
841
+ "two"
842
+ ],
843
+ [
844
+ "eighty-",
845
+ "one"
846
+ ]
847
+ ]
848
+ }
849
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "</s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<pad>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ }
27
+ },
28
+ "bos_token": "<s>",
29
+ "clean_up_tokenization_spaces": false,
30
+ "eos_token": "</s>",
31
+ "extra_special_tokens": {},
32
+ "model_max_length": 1000000000000000019884624838656,
33
+ "pad_token": "<pad>",
34
+ "tokenizer_class": "PreTrainedTokenizerFast",
35
+ "unk_token": null,
36
+ "number_handling": "ltr_3digit"
37
+ }
vocab.json ADDED
@@ -0,0 +1,362 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ty": 266,
3
+ "sixteen": 321,
4
+ "ninety-two": 340,
5
+ "Û": 154,
6
+ "3": 21,
7
+ "twel": 322,
8
+ "<s>": 0,
9
+ "B": 36,
10
+ "-": 15,
11
+ "ĵ": 244,
12
+ "Ç": 134,
13
+ "i": 75,
14
+ "(": 10,
15
+ "Î": 141,
16
+ "el": 312,
17
+ "û": 186,
18
+ "2": 20,
19
+ "¹": 120,
20
+ "Į": 237,
21
+ "hundre": 264,
22
+ "ý": 188,
23
+ "P": 50,
24
+ "Ø": 151,
25
+ "R": 52,
26
+ "ou": 271,
27
+ "9": 27,
28
+ "Ê": 137,
29
+ "Ĵ": 243,
30
+ "&": 8,
31
+ "fifty-": 307,
32
+ "s": 85,
33
+ "ę": 216,
34
+ "¨": 104,
35
+ "ve": 269,
36
+ "¢": 98,
37
+ "seven": 281,
38
+ "ndre": 263,
39
+ "l": 78,
40
+ "Đ": 207,
41
+ "Č": 203,
42
+ "ĥ": 228,
43
+ "eighty": 323,
44
+ "Ġ": 223,
45
+ "Õ": 148,
46
+ "È": 135,
47
+ "¦": 102,
48
+ "?": 33,
49
+ "Ľ": 252,
50
+ "ć": 198,
51
+ "E": 39,
52
+ "ċ": 202,
53
+ "í": 172,
54
+ "¼": 123,
55
+ "seventy-two": 349,
56
+ "£": 99,
57
+ "³": 114,
58
+ "{": 93,
59
+ "¡": 97,
60
+ "ò": 177,
61
+ "ij": 242,
62
+ "á": 160,
63
+ "G": 41,
64
+ "ù": 184,
65
+ "¶": 117,
66
+ "fou": 285,
67
+ "}": 95,
68
+ "v": 88,
69
+ "č": 204,
70
+ "Ú": 153,
71
+ "ă": 194,
72
+ "ª": 106,
73
+ "Ę": 215,
74
+ "sand": 294,
75
+ "gh": 277,
76
+ "<pad>": 2,
77
+ "Ī": 233,
78
+ "sixty-nine": 351,
79
+ "r": 84,
80
+ "Ð": 143,
81
+ "seventy-four": 346,
82
+ "+": 13,
83
+ "Ã": 130,
84
+ "hu": 262,
85
+ "D": 38,
86
+ "[": 61,
87
+ "ten": 315,
88
+ "ĸ": 247,
89
+ "tw": 283,
90
+ "ninety": 316,
91
+ "fi": 276,
92
+ "ree": 288,
93
+ "ä": 163,
94
+ "±": 112,
95
+ "«": 107,
96
+ "twen": 301,
97
+ "ļ": 251,
98
+ "æ": 165,
99
+ "sixty-": 304,
100
+ "Ĩ": 231,
101
+ "Ć": 197,
102
+ "ē": 210,
103
+ "`": 66,
104
+ "nine": 280,
105
+ "\\": 62,
106
+ "Ô": 147,
107
+ "ě": 218,
108
+ "teen": 311,
109
+ "ą": 196,
110
+ "º": 121,
111
+ "¾": 125,
112
+ "ü": 187,
113
+ "z": 92,
114
+ "Ą": 195,
115
+ "_": 65,
116
+ "sixty-six": 353,
117
+ "IJ": 241,
118
+ "Ĝ": 219,
119
+ "sixty-four": 355,
120
+ "Ğ": 221,
121
+ "ĭ": 236,
122
+ "b": 68,
123
+ "ã": 162,
124
+ "1": 19,
125
+ "o": 81,
126
+ "seventy-": 303,
127
+ "¯": 110,
128
+ "Ñ": 144,
129
+ "°": 111,
130
+ "é": 168,
131
+ "eigh": 284,
132
+ "eight": 286,
133
+ "ĉ": 200,
134
+ "Â": 129,
135
+ "en": 292,
136
+ "=": 31,
137
+ "Ģ": 225,
138
+ "×": 150,
139
+ "ß": 158,
140
+ "M": 47,
141
+ "]": 63,
142
+ "thi": 298,
143
+ "ninety-four": 337,
144
+ ":": 28,
145
+ "¤": 100,
146
+ "Ì": 139,
147
+ "rty": 313,
148
+ "ninety-one": 332,
149
+ "een": 310,
150
+ "à": 159,
151
+ "Ń": 258,
152
+ "÷": 182,
153
+ "@": 34,
154
+ "p": 82,
155
+ "ninety-three": 338,
156
+ "¬": 108,
157
+ "6": 24,
158
+ "thousand": 296,
159
+ "seventy-one": 341,
160
+ "Ā": 191,
161
+ "/": 17,
162
+ "Ä": 131,
163
+ "Ă": 193,
164
+ "0": 18,
165
+ "S": 53,
166
+ "´": 115,
167
+ "V": 56,
168
+ "ty-": 267,
169
+ "three": 289,
170
+ "thirty-": 306,
171
+ "twenty": 331,
172
+ "7": 25,
173
+ "®": 109,
174
+ "ė": 214,
175
+ "sixty-five": 357,
176
+ "fif": 299,
177
+ "eighty-one": 359,
178
+ "ÿ": 190,
179
+ "j": 76,
180
+ "î": 173,
181
+ "Ë": 138,
182
+ "rteen": 314,
183
+ "seventy-three": 347,
184
+ "»": 122,
185
+ "H": 42,
186
+ "x": 90,
187
+ "ine": 279,
188
+ "Ė": 213,
189
+ "Ý": 156,
190
+ "ix": 278,
191
+ "ā": 192,
192
+ "ĺ": 249,
193
+ "thou": 295,
194
+ "§": 103,
195
+ "ĩ": 232,
196
+ ".": 16,
197
+ "T": 54,
198
+ "ð": 175,
199
+ "ģ": 226,
200
+ "I": 43,
201
+ "t": 86,
202
+ "Ē": 209,
203
+ "è": 167,
204
+ "Á": 128,
205
+ "seventy-eight": 345,
206
+ "¥": 101,
207
+ "thirty": 326,
208
+ "sixty-seven": 352,
209
+ "ó": 178,
210
+ "sixty-eight": 354,
211
+ "Ö": 149,
212
+ "4": 22,
213
+ "re": 260,
214
+ "Ü": 155,
215
+ "Ď": 205,
216
+ "sixty-two": 358,
217
+ "q": 83,
218
+ "g": 73,
219
+ "Ċ": 201,
220
+ "hundred": 265,
221
+ "Ł": 256,
222
+ "ī": 234,
223
+ "twenty-": 309,
224
+ "ľ": 253,
225
+ "W": 57,
226
+ "©": 105,
227
+ "Ï": 142,
228
+ "Z": 60,
229
+ ",": 14,
230
+ "N": 48,
231
+ "¿": 126,
232
+ "ď": 206,
233
+ "ı": 240,
234
+ "ei": 275,
235
+ "Ĥ": 227,
236
+ "Ě": 217,
237
+ "ï": 174,
238
+ "L": 46,
239
+ "Ĕ": 211,
240
+ "·": 118,
241
+ "rty-": 293,
242
+ "fifty": 328,
243
+ "m": 79,
244
+ "one": 272,
245
+ "sixty-three": 356,
246
+ "seventeen": 319,
247
+ "u": 87,
248
+ "w": 89,
249
+ "Æ": 133,
250
+ "F": 40,
251
+ "two": 291,
252
+ "Ļ": 250,
253
+ "</s>": 1,
254
+ "fourteen": 325,
255
+ "#": 5,
256
+ "four": 287,
257
+ "e": 71,
258
+ "Þ": 157,
259
+ "â": 161,
260
+ "%": 7,
261
+ "y": 91,
262
+ "eighty-": 305,
263
+ "seventy-five": 348,
264
+ "Ħ": 229,
265
+ "Å": 132,
266
+ "Ĺ": 248,
267
+ ">": 32,
268
+ "İ": 239,
269
+ "fifteen": 329,
270
+ "C": 37,
271
+ "even": 274,
272
+ "sixty": 320,
273
+ "Ķ": 245,
274
+ "thousand,": 297,
275
+ "A": 35,
276
+ "ñ": 176,
277
+ "Q": 51,
278
+ "Ù": 152,
279
+ "Í": 140,
280
+ "ķ": 246,
281
+ "ö": 181,
282
+ "c": 69,
283
+ "½": 124,
284
+ "forty-": 308,
285
+ "d": 70,
286
+ "þ": 189,
287
+ "h": 74,
288
+ "U": 55,
289
+ "$": 6,
290
+ "*": 12,
291
+ "ŀ": 255,
292
+ "th": 270,
293
+ "\"": 4,
294
+ "f": 72,
295
+ ")": 11,
296
+ "5": 23,
297
+ "nineteen": 317,
298
+ "five": 290,
299
+ "O": 49,
300
+ "õ": 180,
301
+ "~": 96,
302
+ "Ĭ": 235,
303
+ "ninety-nine": 333,
304
+ "ë": 170,
305
+ "J": 44,
306
+ "ĝ": 220,
307
+ "ú": 185,
308
+ "a": 67,
309
+ "ne": 268,
310
+ "8": 26,
311
+ "ç": 166,
312
+ "å": 164,
313
+ "ninety-six": 335,
314
+ "nd": 259,
315
+ "Ò": 145,
316
+ "|": 94,
317
+ "ninety-seven": 334,
318
+ "thirteen": 327,
319
+ "ĕ": 212,
320
+ "É": 136,
321
+ "²": 113,
322
+ "Y": 59,
323
+ "K": 45,
324
+ ";": 29,
325
+ "seventy-six": 344,
326
+ "¸": 119,
327
+ "k": 77,
328
+ "ninety-eight": 336,
329
+ "fo": 300,
330
+ "ø": 183,
331
+ "ġ": 224,
332
+ "^": 64,
333
+ "eve": 273,
334
+ "Ĉ": 199,
335
+ "seventy-seven": 343,
336
+ "Ŀ": 254,
337
+ "!": 3,
338
+ "eighteen": 324,
339
+ "ô": 179,
340
+ "sixty-one": 350,
341
+ "'": 9,
342
+ "six": 282,
343
+ "đ": 208,
344
+ "ğ": 222,
345
+ "À": 127,
346
+ "seventy-nine": 342,
347
+ "ħ": 230,
348
+ "Ó": 146,
349
+ "X": 58,
350
+ "ninety-five": 339,
351
+ "and": 261,
352
+ "ê": 169,
353
+ "<": 30,
354
+ "n": 80,
355
+ "seventy": 318,
356
+ "ł": 257,
357
+ "µ": 116,
358
+ "forty": 330,
359
+ "į": 238,
360
+ "ì": 171,
361
+ "ninety-": 302
362
+ }