tdooms commited on
Commit
ed97dfd
·
verified ·
1 Parent(s): 23467df

Upload tokenizer

Browse files
Files changed (3) hide show
  1. README.md +3 -0
  2. tokenizer.json +0 -0
  3. tokenizer_config.json +2 -10
README.md CHANGED
@@ -1,3 +1,6 @@
 
 
 
1
  This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 4096.
2
 
3
  The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.
 
1
+ ---
2
+ {}
3
+ ---
4
  This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 4096.
5
 
6
  The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -9,7 +9,7 @@
9
  "special": true
10
  },
11
  "1": {
12
- "content": "[CLS]",
13
  "lstrip": false,
14
  "normalized": false,
15
  "rstrip": false,
@@ -17,15 +17,7 @@
17
  "special": true
18
  },
19
  "2": {
20
- "content": "[SEP]",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "[PAD]",
29
  "lstrip": false,
30
  "normalized": false,
31
  "rstrip": false,
 
9
  "special": true
10
  },
11
  "1": {
12
+ "content": "[BOS]",
13
  "lstrip": false,
14
  "normalized": false,
15
  "rstrip": false,
 
17
  "special": true
18
  },
19
  "2": {
20
+ "content": "[EOS]",
 
 
 
 
 
 
 
 
21
  "lstrip": false,
22
  "normalized": false,
23
  "rstrip": false,