pints-ai/Expository-Prose-V1
Viewer • Updated • 6.67M • 16 • 20
How to use pszemraj/bytebpe-tokenizer-32k-mlm with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("pszemraj/bytebpe-tokenizer-32k-mlm", dtype="auto")BPE tokenizer for encoders/MLM objective with byte-pair fallback:
pints-ai/Expository-Prose-V1; this tokenizer is primarily for English and code.model_max_length is set to 1e9 to not cause hidden issues. Set tokenizer.model_max_length to your model's max position embeddings when training.MLM Tokenizer Configuration (EN/Code)
Model:
Pre-tokenization:
Normalization:
Post-processing:
Decoder:
Key Features:
after loading with autotokenizer and calling print(tokenizer):
PreTrainedTokenizerFast(
name_or_path=(
"repo-name",
),
vocab_size=32000,
model_max_length=1000000000.0,
is_fast=True,
padding_side='right',
truncation_side='right',
special_tokens={
'bos_token': '[CLS]',
'eos_token': '[SEP]',
'unk_token': '[UNK]',
'sep_token': '[SEP]',
'pad_token': '[PAD]',
'cls_token': '[CLS]',
'mask_token': '[MASK]',
},
clean_up_tokenization_spaces=False,
added_tokens_decoder={
0: AddedToken(
"[UNK]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
1: AddedToken(
"[CLS]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
2: AddedToken(
"[SEP]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
3: AddedToken(
"[PAD]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
4: AddedToken(
"[MASK]",
rstrip=False,
lstrip=False,
single_word=False,
normalized=False,
special=True,
),
},
)