Baby Tokenizer (Uncased)

Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.

from transformers import AutoTokenizer

tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer")

from tokenizers import Tokenizer

tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer")

This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

nilq
/

baby-tokenizer-uncased