Upload tokenizer

Files changed (3) hide show

README.md CHANGED Viewed

@@ -1,3 +1,6 @@
 This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 4096.
 The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.

+---
+{}
+---
 This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 4096.
 The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -9,7 +9,7 @@
       "special": true
     },
     "1": {
-      "content": "[CLS]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
@@ -17,15 +17,7 @@
       "special": true
     },
     "2": {
-      "content": "[SEP]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "3": {
-      "content": "[PAD]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,

       "special": true
     },
     "1": {
+      "content": "[BOS]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "special": true
     },
     "2": {
+      "content": "[EOS]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,