phonemetransformers/IPA-BabyLM
Viewer • Updated • 12.5M • 325 • 2
Tokenizers trained for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes.
This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations:
CHAR) vs. subword tokenization (BPE)PHON) vs. orthographic data (TXT)SPACELESS) vs. keeps whitespaceTo load a tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT')