BabyLM 2026 — Multilingual GPT-2 (MorPiece-16K)

Track: BabyLM 2026 Multilingual · Architecture: GPT-2 · Tokenizer: MorPiece 16K (multilingual)

A single, shared-weight GPT-2 language model trained on a balanced trilingual corpus of English, Dutch, and Chinese (100 M byte-premium-adjusted words) for the BabyLM 2026 Challenge multilingual track. The model never receives an explicit language identifier: language identity is implicit in the shared multilingual vocabulary and the model's learned representations.


Model Details

Architecture

Hyperparameter Value
Architecture GPT-2 (GPT2LMHeadModel)
Hidden size (n_embd) 768
Layers (n_layer) 12
Attention heads (n_head) 12
Context length (seq_length) 512
Dropout 0.1
Tied embeddings
Parameters ~117 M

Tokenizer

The model uses MorPiece (v1.4+), a morphologically-aware split-based tokenizer that in this model adopts the --boundary-discovery option (to deal with ZHO); the training starts re-ordering sentences by length, only relying on strong punctuation and line endings. Each split is based on Yang's Sufficiency Principle. The vocabulary (MoP_16K_multilingual) contains 16 000 tokens jointly trained on all three languages.

  • Repository: cristianochesi/morpiece
  • Vocabulary size: 16 000
  • Special tokens: <s> (BOS, id=1), </s> (EOS, id=2), <unk> (id=?), <pad> (id=3), <mask>

Training Data

A curated, cleaned multilingual corpus of English (eng), Dutch (nld), and Chinese (zho), totalling approximately 100 M byte-premium-adjusted (English-equivalent) words. Languages are sampled with weights proportional to their byte premiums (BP: eng=1.000, nld=1.052, zho=0.936) to balance information-content exposure across languages.

Language Byte Premium Sampling weight
English (eng) 1.000 1.000
Dutch (nld) 1.052 1.052
Chinese (zho) 0.936 0.936

The byte-premium adjustment follows Arnett, Chang & Bergen (SIGUL 2024): English-equivalent content = raw UTF-8 bytes ÷ byte premium, ensuring that the budget milestones (checkpoint_<N>M_words) correspond to the BabyLM multilingual track's denomination.

Preprocessing scripts: cristianochesi/babylm-2026 — 01-preprocess

Training Procedure

Hyperparameter Value
Regimen baseline (non-overlapping windows)
Batch size 16 sequences
Gradient accumulation steps 4 (effective batch = 64 seq × 512 tok = 32 768 tokens/step)
Peak learning rate 3 × 10⁻⁴
Minimum learning rate 3 × 10⁻⁵
LR schedule Cosine decay with linear warmup
Warmup 1% of total optimizer steps
Weight decay 0.1
β₁ / β₂ 0.9 / 0.999
Gradient clipping 1.0
AMP precision bfloat16
Epochs 10 passes over each language corpus
Budget milestones BabyLM schedule up to 1 000 M words
Optimizer AdamW (fused when available)

Intermediate checkpoints are saved at BabyLM standard word-budget milestones (1, 2, 3 … 10, 20, 30 … 100, 200 … 1 000 M English-equivalent words) as checkpoint_<N>M_words/ directories, each loadable directly with AutoModelForCausalLM.from_pretrained.

Hardware & Software

  • Framework: PyTorch 2.9.1 + HuggingFace Transformers
  • CUDA 12.8, conda environment env_py3_12_torch2_91_CUDA_12_8
  • Trainer: train_multilingual.py (cristianochesi/babylm-2026)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k")
model = AutoModelForCausalLM.from_pretrained("NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k")

# English
prompt = "The child looked at"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))

# Dutch
prompt_nl = "Het kind keek naar"
inputs = tokenizer(prompt_nl, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))

# Chinese
prompt_zh = "孩子看着"
inputs = tokenizer(prompt_zh, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))

No language identifier is needed. The model infers the language from the input sequence.


Evaluation

This model is evaluated under the BabyLM 2026 multilingual track pipeline. Standard evaluation tasks include:

  • BLiMP / BLiMP-NL / BLiMP-ZH — syntactic minimal-pair acceptability
  • (Super)GLUE / GLUE-NL — downstream NLU benchmarks
  • Perplexity on held-out multilingual test sets

Results will be updated here upon completion of the shared task evaluation.


Limitations

  • The model is trained on a small, child-scale corpus (≤100 M words per language) and is not intended for production NLP applications.
  • Performance on low-frequency phenomena will be limited relative to large-scale LMs.
  • No explicit language control is available; mixing languages within a single prompt may produce unpredictable continuations.
  • Chinese output quality may differ from English/Dutch due to the lower byte premium and the shared BPE tokenizer's segmentation behaviour for scriptio continua.

Citation

If you use this model or the associated training code, please cite:

@misc{chesi2026babylm,
  author       = {Chesi, Cristiano and {NeTS Lab}},
  title        = {{BabyLM 2026 Multilingual GPT-2 (MorPiece-16K)}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k}},
  note         = {Submission to the BabyLM 2026 Challenge, Multilingual Track. IUSS Pavia -- NeTS Lab.}
}

Please also cite the MorPiece tokenizer and the BabyLM shared task:

@misc{chesi2024morpiece,
  author       = {Chesi, Cristiano and {NeTS Lab @ IUSS}},
  title        = {{MorPiece: A Morphologically-Aware Tokenizer Based on Yang's Tolerance Principle}},
  year         = {2024},
  howpublished = {\url{https://github.com/cristianochesi/morpiece}}
}

Model Card Contact

Cristiano Chesi — NeTS Lab, IUSS Pavia nets.iusspavia.it

Downloads last month
1,738
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support