Add meta data
Browse files
README.md
CHANGED
|
@@ -1,3 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Tokenizer
|
| 2 |
|
| 3 |
Finetune here to talk a bit about [NovelAI](https://novelai.net/)'s new tokenizer that I worked on. First a quick reminder. In most cases, our models don't see words as individual letters. Instead, text is broken down into tokens, which are words or word fragments. For example, the sentence “`The quick brown fox jumps over the goblin.`” would tokenize as “`The| quick| brown| fox| jumps| over| the| go|bl|in.`” in the Pile tokenizer used by GPT-NeoX 20B and Krake, with each | signifying a boundary between tokens.
|
|
@@ -64,4 +73,4 @@ print("Readable tokens:", s.encode(text, out_type=str))
|
|
| 64 |
|
| 65 |
## License
|
| 66 |
|
| 67 |
-
The tokenizer is licensed under the GNU General Public License, version 2.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: gpl-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- ja
|
| 6 |
+
tags:
|
| 7 |
+
- tokenizer
|
| 8 |
+
- novelai
|
| 9 |
+
---
|
| 10 |
# Tokenizer
|
| 11 |
|
| 12 |
Finetune here to talk a bit about [NovelAI](https://novelai.net/)'s new tokenizer that I worked on. First a quick reminder. In most cases, our models don't see words as individual letters. Instead, text is broken down into tokens, which are words or word fragments. For example, the sentence “`The quick brown fox jumps over the goblin.`” would tokenize as “`The| quick| brown| fox| jumps| over| the| go|bl|in.`” in the Pile tokenizer used by GPT-NeoX 20B and Krake, with each | signifying a boundary between tokens.
|
|
|
|
| 73 |
|
| 74 |
## License
|
| 75 |
|
| 76 |
+
The tokenizer is licensed under the GNU General Public License, version 2.
|