Mismatch between tokenizer and model vocab size?

#72

by yairschiff - opened Mar 5, 2025

Mar 5, 2025

When I run the following snippet:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")

print(f"{tokenizer.vocab_size=} vs. {model.config.vocab_size=}")

There appears to be a mismatch in size:

>>> tokenizer.vocab_size=50280 vs. model.config.vocab_size=50368

What is the correct usage of the tokenizer / model combination here?

yairschiff

Mar 18, 2025

Sorry about this. I realized I should have been using len(tokenizer) which indeed returns 50368.

yairschiff changed discussion status to closed Mar 18, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment