|
|
--- |
|
|
datasets: |
|
|
- phonemetransformers/IPA-CHILDES |
|
|
language: |
|
|
- en |
|
|
- eu |
|
|
- zh |
|
|
- da |
|
|
- nl |
|
|
- hr |
|
|
- es |
|
|
- et |
|
|
- fa |
|
|
- fr |
|
|
- de |
|
|
- hu |
|
|
- is |
|
|
- id |
|
|
- ga |
|
|
- it |
|
|
- ja |
|
|
- ko |
|
|
- pt |
|
|
- pl |
|
|
- qu |
|
|
- ro |
|
|
- sr |
|
|
- sv |
|
|
- tr |
|
|
- cy |
|
|
- 'no' |
|
|
--- |
|
|
|
|
|
# IPA CHILDES Models: Tiny |
|
|
|
|
|
Phoneme-based GPT-2 models trained on all 31 sections of the [IPA-CHILDES](https://huggingface.co/datasets/phonemetransformers/IPA-CHILDES) dataset for the paper [BabyLM's First Words: Word Segmentation as a Phonological Probing Task](https://arxiv.org/abs/2504.03338). |
|
|
|
|
|
The models have 600k non-embedding parameters and were trained on 100k tokens of their language. They were evaluated for phonological knowledge using the *word segmentation* task. Check out the paper for more details. Training and analysis scripts can be found [here](https://github.com/codebyzeb/PhonemeTransformers). |
|
|
|
|
|
To load a model: |
|
|
```python |
|
|
from transformers import AutoModel |
|
|
farsi_model = AutoModel.from_pretrained('phonemetransformers/ipa-childes-models-tiny', subfolder='Farsi') |
|
|
``` |