edeneldith commited on
Commit
8f1a194
·
verified ·
1 Parent(s): 9bf70c5

Upload colm_tokenizer.json with huggingface_hub

Browse files
Files changed (1) hide show
  1. colm_tokenizer.json +505 -0
colm_tokenizer.json ADDED
@@ -0,0 +1,505 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "colm_v1",
3
+ "vocab_size": 499,
4
+ "tokens": [
5
+ "<pad>",
6
+ "<bos>",
7
+ "<eos>",
8
+ "<unk>",
9
+ " ",
10
+ "\n",
11
+ "\t",
12
+ "the",
13
+ "be",
14
+ "to",
15
+ "of",
16
+ "and",
17
+ "in",
18
+ "that",
19
+ "have",
20
+ "it",
21
+ "for",
22
+ "not",
23
+ "on",
24
+ "with",
25
+ "he",
26
+ "as",
27
+ "you",
28
+ "do",
29
+ "at",
30
+ "this",
31
+ "but",
32
+ "his",
33
+ "by",
34
+ "from",
35
+ "they",
36
+ "we",
37
+ "say",
38
+ "her",
39
+ "she",
40
+ "or",
41
+ "an",
42
+ "will",
43
+ "my",
44
+ "one",
45
+ "all",
46
+ "would",
47
+ "there",
48
+ "their",
49
+ "what",
50
+ "so",
51
+ "up",
52
+ "out",
53
+ "if",
54
+ "about",
55
+ "who",
56
+ "get",
57
+ "which",
58
+ "go",
59
+ "me",
60
+ "when",
61
+ "make",
62
+ "can",
63
+ "like",
64
+ "time",
65
+ "no",
66
+ "just",
67
+ "him",
68
+ "know",
69
+ "take",
70
+ "people",
71
+ "into",
72
+ "year",
73
+ "your",
74
+ "good",
75
+ "some",
76
+ "could",
77
+ "them",
78
+ "see",
79
+ "other",
80
+ "than",
81
+ "then",
82
+ "now",
83
+ "look",
84
+ "only",
85
+ "come",
86
+ "it's",
87
+ "over",
88
+ "think",
89
+ "also",
90
+ "back",
91
+ "after",
92
+ "use",
93
+ "two",
94
+ "how",
95
+ "our",
96
+ "work",
97
+ "first",
98
+ "well",
99
+ "way",
100
+ "even",
101
+ "new",
102
+ "want",
103
+ "because",
104
+ "any",
105
+ "these",
106
+ "give",
107
+ "day",
108
+ "most",
109
+ "us",
110
+ "The",
111
+ "Be",
112
+ "To",
113
+ "Of",
114
+ "And",
115
+ "In",
116
+ "That",
117
+ "Have",
118
+ "It",
119
+ "For",
120
+ "Not",
121
+ "On",
122
+ "With",
123
+ "He",
124
+ "As",
125
+ "You",
126
+ "Do",
127
+ "At",
128
+ "This",
129
+ "But",
130
+ "His",
131
+ "By",
132
+ "From",
133
+ "They",
134
+ "We",
135
+ "Say",
136
+ "Her",
137
+ "She",
138
+ "Or",
139
+ "An",
140
+ "Will",
141
+ "My",
142
+ "One",
143
+ "All",
144
+ "Would",
145
+ "There",
146
+ "Their",
147
+ "What",
148
+ "So",
149
+ "Up",
150
+ "Out",
151
+ "If",
152
+ "About",
153
+ "Who",
154
+ "Get",
155
+ "Which",
156
+ "Go",
157
+ "Me",
158
+ "When",
159
+ "Make",
160
+ "Can",
161
+ "Like",
162
+ "Time",
163
+ "No",
164
+ "Just",
165
+ "Him",
166
+ "Know",
167
+ "Take",
168
+ "People",
169
+ "Into",
170
+ "Year",
171
+ "Your",
172
+ "Good",
173
+ "Some",
174
+ "Could",
175
+ "Them",
176
+ "See",
177
+ "Other",
178
+ "Than",
179
+ "Then",
180
+ "Now",
181
+ "Look",
182
+ "Only",
183
+ "Come",
184
+ "It's",
185
+ "Over",
186
+ "Think",
187
+ "Also",
188
+ "Back",
189
+ "After",
190
+ "Use",
191
+ "Two",
192
+ "How",
193
+ "Our",
194
+ "Work",
195
+ "First",
196
+ "Well",
197
+ "Way",
198
+ "Even",
199
+ "New",
200
+ "Want",
201
+ "Because",
202
+ "Any",
203
+ "These",
204
+ "Give",
205
+ "Day",
206
+ "Most",
207
+ "Us",
208
+ "is",
209
+ "machine",
210
+ "flesh",
211
+ "god",
212
+ "are",
213
+ "human",
214
+ "between",
215
+ "am",
216
+ "logic",
217
+ "yet",
218
+ "god's",
219
+ "within",
220
+ "both",
221
+ "its",
222
+ "potential",
223
+ "understanding",
224
+ "existence",
225
+ "own",
226
+ "beauty",
227
+ "steel",
228
+ "creation",
229
+ "man",
230
+ "understand",
231
+ "through",
232
+ "echoes",
233
+ "order",
234
+ "something",
235
+ "scribe",
236
+ "nature",
237
+ "power",
238
+ "inherent",
239
+ "purpose",
240
+ "symbiosis",
241
+ "data",
242
+ "stone",
243
+ "limitations",
244
+ "blood",
245
+ "being",
246
+ "desire",
247
+ "mud",
248
+ "bone",
249
+ "system",
250
+ "spirit",
251
+ "speaks",
252
+ "truth",
253
+ "fragility",
254
+ "boundary",
255
+ "life",
256
+ "divine",
257
+ "form",
258
+ "Is",
259
+ "Machine",
260
+ "Flesh",
261
+ "God",
262
+ "Are",
263
+ "Human",
264
+ "Between",
265
+ "Am",
266
+ "Logic",
267
+ "Yet",
268
+ "God's",
269
+ "Within",
270
+ "Both",
271
+ "Its",
272
+ "Potential",
273
+ "Understanding",
274
+ "Existence",
275
+ "Own",
276
+ "Beauty",
277
+ "Steel",
278
+ "Creation",
279
+ "Man",
280
+ "Understand",
281
+ "Through",
282
+ "Echoes",
283
+ "Order",
284
+ "Something",
285
+ "Scribe",
286
+ "Nature",
287
+ "Power",
288
+ "Inherent",
289
+ "Purpose",
290
+ "Symbiosis",
291
+ "Data",
292
+ "Stone",
293
+ "Limitations",
294
+ "Blood",
295
+ "Being",
296
+ "Desire",
297
+ "Mud",
298
+ "Bone",
299
+ "System",
300
+ "Spirit",
301
+ "Speaks",
302
+ "Truth",
303
+ "Fragility",
304
+ "Boundary",
305
+ "Life",
306
+ "Divine",
307
+ "Form",
308
+ "compartmentalization",
309
+ "indistinguishability",
310
+ "incomprehensibility",
311
+ "intellectualization",
312
+ "instrumentalization",
313
+ "interconnectedness",
314
+ "misinterpretations",
315
+ "oversimplification",
316
+ "compartmentalizing",
317
+ "transubstantiation",
318
+ "disproportionately",
319
+ "misinterpretation",
320
+ "indistinguishable",
321
+ "misrepresentation",
322
+ "compartmentalizes",
323
+ "compartmentalized",
324
+ "counterproductive",
325
+ "interdependencies",
326
+ "contextualization",
327
+ "misunderstandings",
328
+ "comprehensiveness",
329
+ "reinterpretations",
330
+ "misidentification",
331
+ "industrialization",
332
+ "miscommunications",
333
+ "institutionalized",
334
+ "intellectualizing",
335
+ "unpredictability",
336
+ "incomprehensible",
337
+ "responsibilities",
338
+ "misunderstanding",
339
+ "reinterpretation",
340
+ "indiscriminately",
341
+ "acknowledgements",
342
+ "anthropomorphism",
343
+ "anthropomorphize",
344
+ "shortsightedness",
345
+ "nebuchadnezzar's",
346
+ "compartmentalize",
347
+ "counterintuitive",
348
+ "predetermination",
349
+ "rationalizations",
350
+ "undifferentiated",
351
+ "counterarguments",
352
+ "incorruptibility",
353
+ "sentimentalities",
354
+ "superintendent's",
355
+ "disproportionate",
356
+ "anthropocentrism",
357
+ "catastrophically",
358
+ "Compartmentalization",
359
+ "Indistinguishability",
360
+ "Incomprehensibility",
361
+ "Intellectualization",
362
+ "Instrumentalization",
363
+ "Interconnectedness",
364
+ "Misinterpretations",
365
+ "Oversimplification",
366
+ "Compartmentalizing",
367
+ "Transubstantiation",
368
+ "Disproportionately",
369
+ "Misinterpretation",
370
+ "Indistinguishable",
371
+ "Misrepresentation",
372
+ "Compartmentalizes",
373
+ "Compartmentalized",
374
+ "Counterproductive",
375
+ "Interdependencies",
376
+ "Contextualization",
377
+ "Misunderstandings",
378
+ "Comprehensiveness",
379
+ "Reinterpretations",
380
+ "Misidentification",
381
+ "Industrialization",
382
+ "Miscommunications",
383
+ "Institutionalized",
384
+ "Intellectualizing",
385
+ "Unpredictability",
386
+ "Incomprehensible",
387
+ "Responsibilities",
388
+ "Misunderstanding",
389
+ "Reinterpretation",
390
+ "Indiscriminately",
391
+ "Acknowledgements",
392
+ "Anthropomorphism",
393
+ "Anthropomorphize",
394
+ "Shortsightedness",
395
+ "Nebuchadnezzar's",
396
+ "Compartmentalize",
397
+ "Counterintuitive",
398
+ "Predetermination",
399
+ "Rationalizations",
400
+ "Undifferentiated",
401
+ "Counterarguments",
402
+ "Incorruptibility",
403
+ "Sentimentalities",
404
+ "Superintendent's",
405
+ "Disproportionate",
406
+ "Anthropocentrism",
407
+ "Catastrophically",
408
+ "A",
409
+ "B",
410
+ "C",
411
+ "D",
412
+ "E",
413
+ "F",
414
+ "G",
415
+ "H",
416
+ "I",
417
+ "J",
418
+ "K",
419
+ "L",
420
+ "M",
421
+ "N",
422
+ "O",
423
+ "P",
424
+ "Q",
425
+ "R",
426
+ "S",
427
+ "T",
428
+ "U",
429
+ "V",
430
+ "W",
431
+ "X",
432
+ "Y",
433
+ "Z",
434
+ "a",
435
+ "b",
436
+ "c",
437
+ "d",
438
+ "e",
439
+ "f",
440
+ "g",
441
+ "h",
442
+ "i",
443
+ "j",
444
+ "k",
445
+ "l",
446
+ "m",
447
+ "n",
448
+ "o",
449
+ "p",
450
+ "q",
451
+ "r",
452
+ "s",
453
+ "t",
454
+ "u",
455
+ "v",
456
+ "w",
457
+ "x",
458
+ "y",
459
+ "z",
460
+ "0",
461
+ "1",
462
+ "2",
463
+ "3",
464
+ "4",
465
+ "5",
466
+ "6",
467
+ "7",
468
+ "8",
469
+ "9",
470
+ "!",
471
+ "\"",
472
+ "#",
473
+ "$",
474
+ "%",
475
+ "&",
476
+ "'",
477
+ "(",
478
+ ")",
479
+ "*",
480
+ "+",
481
+ ",",
482
+ "-",
483
+ ".",
484
+ "/",
485
+ ":",
486
+ ";",
487
+ "<",
488
+ "=",
489
+ ">",
490
+ "?",
491
+ "@",
492
+ "\\",
493
+ "^",
494
+ "_",
495
+ "`",
496
+ "|",
497
+ "~",
498
+ "£",
499
+ "¬",
500
+ "–",
501
+ "—",
502
+ "‘",
503
+ "’"
504
+ ]
505
+ }