File size: 1,937 Bytes
de00322
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: mit
library_name: transformers
tags:
- cl100k_base
- tiktoken
---

# cl100k_base as `transformers` GPT2 tokenizer

`cl100k_base` vocab converted from `tiktoken` to hf via [this code](https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee) by Xenova.

```py
from transformers import GPT2TokenizerFast, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/cl100k_base")
# if issues, try GPT2TokenizerFast directly
```


## details

```py
GPT2TokenizerFast(
    name_or_path="BEE-spoke-data/cl100k_base",
    vocab_size=100261,
    model_max_length=8192,
    is_fast=True,
    padding_side="right",
    truncation_side="right",
    special_tokens={
        "bos_token": "<|endoftext|>",
        "eos_token": "<|endoftext|>",
        "unk_token": "<|endoftext|>",
    },
    clean_up_tokenization_spaces=True,
    added_tokens_decoder={
        "100257": AddedToken(
            "<|endoftext|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100258": AddedToken(
            "<|fim_prefix|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100259": AddedToken(
            "<|fim_middle|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100260": AddedToken(
            "<|fim_suffix|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100276": AddedToken(
            "<|endofprompt|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
    },
)
```