Fill-Mask
Transformers
PyTorch
Bulgarian
bert
torch

pad and unk indices are outside the max tokenizer ID

#3
by AngledLuffa - opened

After loading the tokenizer, I have

vocab_size=119547,
added_tokens_decoder={
        2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119547: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119548: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

This is problematic because if you try to embed multiple sentences at the same time using a padding and attention mask, it throws an exception because the padding token can't go through the embedding

Sign up or log in to comment