pad and unk indices are outside the max tokenizer ID
#3
by
AngledLuffa - opened
After loading the tokenizer, I have
vocab_size=119547,
added_tokens_decoder={
2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
119547: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
119548: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
This is problematic because if you try to embed multiple sentences at the same time using a padding and attention mask, it throws an exception because the padding token can't go through the embedding