| | --- |
| | license: mit |
| | language: |
| | - en |
| | tags: |
| | - babylm |
| | - tokenizer |
| | datasets: |
| | - nilq/babylm-100M |
| | --- |
| | |
| | ## Baby Tokenizer |
| |
|
| | Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language. |
| |
|
| | ### Usage |
| |
|
| | #### Transformers |
| |
|
| | ```py |
| | from transformers import AutoTokenizer |
| | |
| | tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer") |
| | ``` |
| |
|
| | #### Tokenizers |
| |
|
| | ```py |
| | from tokenizers import Tokenizer |
| | |
| | tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer") |
| | ``` |
| |
|
| | ### Data |
| |
|
| | This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources: |
| | - CHILDES (child-directed speech) |
| | - Subtitles (speech) |
| | - BNC (speech) |
| | - TED talks (speech) |
| | - children's books (simple written language). |
| |
|
| | ### Specifications |
| |
|
| | - Vocabulary size: 20k |
| | - Alphabet limit: 150 |
| | - Minimum token frequency: 100 |