Instructions to use answerdotai/ModernBERT-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use answerdotai/ModernBERT-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="answerdotai/ModernBERT-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base") - Notebooks
- Google Colab
- Kaggle
Mismatch between tokenizer and model vocab size?
#72
by yairschiff - opened
When I run the following snippet:
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
print(f"{tokenizer.vocab_size=} vs. {model.config.vocab_size=}")
There appears to be a mismatch in size:
>>> tokenizer.vocab_size=50280 vs. model.config.vocab_size=50368
What is the correct usage of the tokenizer / model combination here?
Sorry about this. I realized I should have been using len(tokenizer) which indeed returns 50368.
yairschiff changed discussion status to closed