|
|
--- |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
This is a simple `PreTrainedTokenizerFast` with 5120 tokens trained on a subset of [karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle), which is itself a subset of [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). |
|
|
|
|
|
The tokenizer includes 6 special tokens: |
|
|
|
|
|
```py |
|
|
class SpecialTokens: |
|
|
PAD = 0 |
|
|
BOS = 1 |
|
|
EOS = 2 |
|
|
SYSTEM = 3 |
|
|
USER = 4 |
|
|
ASSISTANT = 5 |
|
|
|
|
|
special_tokens_map = { |
|
|
"<|PAD|>": SpecialTokens.PAD, |
|
|
"<|BOS|>": SpecialTokens.BOS, |
|
|
"<|EOS|>": SpecialTokens.EOS, |
|
|
"<|SYSTEM|>": SpecialTokens.SYSTEM, |
|
|
"<|USER|>": SpecialTokens.USER, |
|
|
"<|ASSISTANT|>": SpecialTokens.ASSISTANT |
|
|
} |
|
|
``` |