pad and unk indices are outside the max tokenizer ID

by AngledLuffa - opened Mar 2

Mar 2

After loading the tokenizer, I have

vocab_size=119547,
added_tokens_decoder={
        2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119547: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119548: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

This is problematic because if you try to embed multiple sentences at the same time using a padding and attention mask, it throws an exception because the padding token can't go through the embedding

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment