cl100k_base / README.md
pszemraj's picture
Super-squash branch 'main' using huggingface_hub
de00322 verified
metadata
license: mit
library_name: transformers
tags:
  - cl100k_base
  - tiktoken

cl100k_base as transformers GPT2 tokenizer

cl100k_base vocab converted from tiktoken to hf via this code by Xenova.

from transformers import GPT2TokenizerFast, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/cl100k_base")
# if issues, try GPT2TokenizerFast directly

details

GPT2TokenizerFast(
    name_or_path="BEE-spoke-data/cl100k_base",
    vocab_size=100261,
    model_max_length=8192,
    is_fast=True,
    padding_side="right",
    truncation_side="right",
    special_tokens={
        "bos_token": "<|endoftext|>",
        "eos_token": "<|endoftext|>",
        "unk_token": "<|endoftext|>",
    },
    clean_up_tokenization_spaces=True,
    added_tokens_decoder={
        "100257": AddedToken(
            "<|endoftext|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100258": AddedToken(
            "<|fim_prefix|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100259": AddedToken(
            "<|fim_middle|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100260": AddedToken(
            "<|fim_suffix|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100276": AddedToken(
            "<|endofprompt|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
    },
)