File size: 1,937 Bytes
de00322 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
---
license: mit
library_name: transformers
tags:
- cl100k_base
- tiktoken
---
# cl100k_base as `transformers` GPT2 tokenizer
`cl100k_base` vocab converted from `tiktoken` to hf via [this code](https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee) by Xenova.
```py
from transformers import GPT2TokenizerFast, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/cl100k_base")
# if issues, try GPT2TokenizerFast directly
```
## details
```py
GPT2TokenizerFast(
name_or_path="BEE-spoke-data/cl100k_base",
vocab_size=100261,
model_max_length=8192,
is_fast=True,
padding_side="right",
truncation_side="right",
special_tokens={
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"unk_token": "<|endoftext|>",
},
clean_up_tokenization_spaces=True,
added_tokens_decoder={
"100257": AddedToken(
"<|endoftext|>",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
"100258": AddedToken(
"<|fim_prefix|>",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
"100259": AddedToken(
"<|fim_middle|>",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
"100260": AddedToken(
"<|fim_suffix|>",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
"100276": AddedToken(
"<|endofprompt|>",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
},
)
``` |