File size: 795 Bytes
29f7ef7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
---
library_name: transformers
---

This is a simple `PreTrainedTokenizerFast` with 5120 tokens trained on a subset of [karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle), which is itself a subset of [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).

The tokenizer includes 6 special tokens:

```py
class SpecialTokens:
    PAD       = 0
    BOS       = 1
    EOS       = 2
    SYSTEM    = 3
    USER      = 4
    ASSISTANT = 5

special_tokens_map = {
    "<|PAD|>":       SpecialTokens.PAD,
    "<|BOS|>":       SpecialTokens.BOS,
    "<|EOS|>":       SpecialTokens.EOS,
    "<|SYSTEM|>":    SpecialTokens.SYSTEM,
    "<|USER|>":      SpecialTokens.USER,
    "<|ASSISTANT|>": SpecialTokens.ASSISTANT
}
```