Instructions to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline") model = AutoModelForCausalLM.from_pretrained("NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline
- SGLang
How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with Docker Model Runner:
docker model run hf.co/NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline
BabyLM 2026 — Multilingual GPT-2 (MorPiece-16K)
Track: BabyLM 2026 Multilingual · Architecture: GPT-2 · Tokenizer: MorPiece 16K (multilingual)
A single, shared-weight GPT-2 language model trained on a balanced trilingual corpus of English, Dutch, and Chinese (100 M byte-premium-adjusted words) for the BabyLM 2026 Challenge multilingual track. The model never receives an explicit language identifier: language identity is implicit in the shared multilingual vocabulary and the model's learned representations.
Model Details
Architecture
| Hyperparameter | Value |
|---|---|
| Architecture | GPT-2 (GPT2LMHeadModel) |
Hidden size (n_embd) |
768 |
Layers (n_layer) |
12 |
Attention heads (n_head) |
12 |
Context length (seq_length) |
512 |
| Dropout | 0.1 |
| Tied embeddings | ✓ |
| Parameters | ~117 M |
Tokenizer
The model uses MorPiece (v1.4+), a morphologically-aware split-based tokenizer that in this model adopts the --boundary-discovery option (to deal with ZHO); the
training starts re-ordering sentences by length, only relying on strong punctuation and line endings. Each split is based on Yang's Sufficiency Principle. The vocabulary (MoP_16K_multilingual) contains 16 000 tokens jointly trained on all three languages.
- Repository: cristianochesi/morpiece
- Vocabulary size: 16 000
- Special tokens:
<s>(BOS, id=1),</s>(EOS, id=2),<unk>(id=?),<pad>(id=3),<mask>
Training Data
A curated, cleaned multilingual corpus of English (eng), Dutch (nld), and Chinese (zho), totalling approximately 100 M byte-premium-adjusted (English-equivalent) words. Languages are sampled with weights proportional to their byte premiums (BP: eng=1.000, nld=1.052, zho=0.936) to balance information-content exposure across languages.
| Language | Byte Premium | Sampling weight |
|---|---|---|
| English (eng) | 1.000 | 1.000 |
| Dutch (nld) | 1.052 | 1.052 |
| Chinese (zho) | 0.936 | 0.936 |
The byte-premium adjustment follows Arnett, Chang & Bergen (SIGUL 2024): English-equivalent content = raw UTF-8 bytes ÷ byte premium, ensuring that the budget milestones (checkpoint_<N>M_words) correspond to the BabyLM multilingual track's denomination.
Preprocessing scripts: cristianochesi/babylm-2026 — 01-preprocess
Training Procedure
| Hyperparameter | Value |
|---|---|
| Regimen | baseline (non-overlapping windows) |
| Batch size | 16 sequences |
| Gradient accumulation steps | 4 (effective batch = 64 seq × 512 tok = 32 768 tokens/step) |
| Peak learning rate | 3 × 10⁻⁴ |
| Minimum learning rate | 3 × 10⁻⁵ |
| LR schedule | Cosine decay with linear warmup |
| Warmup | 1% of total optimizer steps |
| Weight decay | 0.1 |
| β₁ / β₂ | 0.9 / 0.999 |
| Gradient clipping | 1.0 |
| AMP precision | bfloat16 |
| Epochs | 10 passes over each language corpus |
| Budget milestones | BabyLM schedule up to 1 000 M words |
| Optimizer | AdamW (fused when available) |
Intermediate checkpoints are saved at BabyLM standard word-budget milestones (1, 2, 3 … 10, 20, 30 … 100, 200 … 1 000 M English-equivalent words) as checkpoint_<N>M_words/ directories, each loadable directly with AutoModelForCausalLM.from_pretrained.
Hardware & Software
- Framework: PyTorch 2.9.1 + HuggingFace Transformers
- CUDA 12.8, conda environment
env_py3_12_torch2_91_CUDA_12_8 - Trainer:
train_multilingual.py(cristianochesi/babylm-2026)
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k")
model = AutoModelForCausalLM.from_pretrained("NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k")
# English
prompt = "The child looked at"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))
# Dutch
prompt_nl = "Het kind keek naar"
inputs = tokenizer(prompt_nl, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))
# Chinese
prompt_zh = "孩子看着"
inputs = tokenizer(prompt_zh, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))
No language identifier is needed. The model infers the language from the input sequence.
Evaluation
This model is evaluated under the BabyLM 2026 multilingual track pipeline. Standard evaluation tasks include:
- BLiMP / BLiMP-NL / BLiMP-ZH — syntactic minimal-pair acceptability
- (Super)GLUE / GLUE-NL — downstream NLU benchmarks
- Perplexity on held-out multilingual test sets
Results will be updated here upon completion of the shared task evaluation.
Limitations
- The model is trained on a small, child-scale corpus (≤100 M words per language) and is not intended for production NLP applications.
- Performance on low-frequency phenomena will be limited relative to large-scale LMs.
- No explicit language control is available; mixing languages within a single prompt may produce unpredictable continuations.
- Chinese output quality may differ from English/Dutch due to the lower byte premium and the shared BPE tokenizer's segmentation behaviour for scriptio continua.
Citation
If you use this model or the associated training code, please cite:
@misc{chesi2026babylm,
author = {Chesi, Cristiano and {NeTS Lab}},
title = {{BabyLM 2026 Multilingual GPT-2 (MorPiece-16K)}},
year = {2026},
howpublished = {\url{https://huggingface.co/NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k}},
note = {Submission to the BabyLM 2026 Challenge, Multilingual Track. IUSS Pavia -- NeTS Lab.}
}
Please also cite the MorPiece tokenizer and the BabyLM shared task:
@misc{chesi2024morpiece,
author = {Chesi, Cristiano and {NeTS Lab @ IUSS}},
title = {{MorPiece: A Morphologically-Aware Tokenizer Based on Yang's Tolerance Principle}},
year = {2024},
howpublished = {\url{https://github.com/cristianochesi/morpiece}}
}
Model Card Contact
Cristiano Chesi — NeTS Lab, IUSS Pavia nets.iusspavia.it
- Downloads last month
- 1,738