| | --- |
| | language: |
| | - en |
| | --- |
| | |
| | **lexdec-medium-bpe** is a small, autoregressive llama model featuring character-level tokenization, trained on the 2024/2025 [BabyLM dataset](https://osf.io/ryjfm/). The *checkpoints* branch contains 19 checkpoints, 10 across the first 10% of pretraining and 9 more for the remaining 9 percent of pretraining. |
| |
|
| | We used this model to trace the development of linguistic knowledge (word-level, syntax) across pretraining and to compare it to both larger character-level models and comparable subword models: |
| |
|
| | | | [small-char](https://huggingface.co/bbunzeck/lexdec-small-char) | [medium-char](https://huggingface.co/bbunzeck/lexdec-medium-char) | [large-char](https://huggingface.co/bbunzeck/lexdec-large-char) | [small-bpe](https://huggingface.co/bbunzeck/lexdec-small-bpe) | [medium-bpe](https://huggingface.co/bbunzeck/lexdec-medium-bpe) | [large-bpe](https://huggingface.co/bbunzeck/lexdec-large-bpe) | |
| | |---|---:|---:|---:|---:|---:|---:| |
| | | Embedding size | 128 | 256 | 512 | 128 | 256 | 512 | |
| | | Hidden size | 128 | 256 | 512 | 128 | 256 | 512 | |
| | | Layers | 4 | 8 | 12 | 4 | 8 | 12 | |
| | | Attention heads | 4 | 8 | 12 | 4 | 8 | 12 | |
| | | Context size | 128 | 128 | 128 | 128 | 128 | 128 | |
| | | Vocab. size | 102 | 102 | 102 | 8,002 | 8,002 | 8,002 | |
| | | Parameters | 486,016 | 3,726,592 | 21,940,736 | 2,508,416 | 7,771,392 | 30,030,336 | |
| |
|
| | If you use this model, please cite the following preprint (the final version will be added as soon as it is published): |
| |
|
| | ``` |
| | @misc{bunzeck2025subwordmodelsstruggleword, |
| | title={Subword models struggle with word learning, but surprisal hides it}, |
| | author={Bastian Bunzeck and Sina Zarrieß}, |
| | year={2025}, |
| | eprint={2502.12835}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2502.12835},} |
| | ``` |