nilq/babylm-100M
Viewer โข Updated โข 12.7M โข 93
Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.
from transformers import AutoTokenizer
tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer")
from tokenizers import Tokenizer
tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer")
This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources: