Instructions to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline")
model = AutoModelForCausalLM.from_pretrained("NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline

SGLang

How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline with Docker Model Runner:
```
docker model run hf.co/NeTS-lab/babylm26_multiling_gpt2_MoP16K_baseline
```

BabyLM 2026 — Multilingual GPT-2 (MorPiece-16K)

Track: BabyLM 2026 Multilingual · Architecture: GPT-2 · Tokenizer: MorPiece 16K (multilingual)

A single, shared-weight GPT-2 language model trained on a balanced trilingual corpus of English, Dutch, and Chinese (100 M byte-premium-adjusted words) for the BabyLM 2026 Challenge multilingual track. The model never receives an explicit language identifier: language identity is implicit in the shared multilingual vocabulary and the model's learned representations.

Model Details

Architecture

Hyperparameter	Value
Architecture	GPT-2 (`GPT2LMHeadModel`)
Hidden size (`n_embd`)	768
Layers (`n_layer`)	12
Attention heads (`n_head`)	12
Context length (`seq_length`)	512
Dropout	0.1
Tied embeddings	✓
Parameters	~117 M

Tokenizer

The model uses MorPiece (v1.4+), a morphologically-aware split-based tokenizer that in this model adopts the --boundary-discovery option (to deal with ZHO); the training starts re-ordering sentences by length, only relying on strong punctuation and line endings. Each split is based on Yang's Sufficiency Principle. The vocabulary (MoP_16K_multilingual) contains 16 000 tokens jointly trained on all three languages.

Repository: cristianochesi/morpiece
Vocabulary size: 16 000
Special tokens: <s> (BOS, id=1), </s> (EOS, id=2), <unk> (id=?), <pad> (id=3), <mask>

Training Data

A curated, cleaned multilingual corpus of English (eng), Dutch (nld), and Chinese (zho), totalling approximately 100 M byte-premium-adjusted (English-equivalent) words. Languages are sampled with weights proportional to their byte premiums (BP: eng=1.000, nld=1.052, zho=0.936) to balance information-content exposure across languages.

Language	Byte Premium	Sampling weight
English (eng)	1.000	1.000
Dutch (nld)	1.052	1.052
Chinese (zho)	0.936	0.936

The byte-premium adjustment follows Arnett, Chang & Bergen (SIGUL 2024): English-equivalent content = raw UTF-8 bytes ÷ byte premium, ensuring that the budget milestones (checkpoint_<N>M_words) correspond to the BabyLM multilingual track's denomination.

Preprocessing scripts: cristianochesi/babylm-2026 — 01-preprocess

Training Procedure

Hyperparameter	Value
Regimen	`baseline` (non-overlapping windows)
Batch size	16 sequences
Gradient accumulation steps	4 (effective batch = 64 seq × 512 tok = 32 768 tokens/step)
Peak learning rate	3 × 10⁻⁴
Minimum learning rate	3 × 10⁻⁵
LR schedule	Cosine decay with linear warmup
Warmup	1% of total optimizer steps
Weight decay	0.1
β₁ / β₂	0.9 / 0.999
Gradient clipping	1.0
AMP precision	bfloat16
Epochs	10 passes over each language corpus
Budget milestones	BabyLM schedule up to 1 000 M words
Optimizer	AdamW (fused when available)

Intermediate checkpoints are saved at BabyLM standard word-budget milestones (1, 2, 3 … 10, 20, 30 … 100, 200 … 1 000 M English-equivalent words) as checkpoint_<N>M_words/ directories, each loadable directly with AutoModelForCausalLM.from_pretrained.

Hardware & Software

Framework: PyTorch 2.9.1 + HuggingFace Transformers
CUDA 12.8, conda environment env_py3_12_torch2_91_CUDA_12_8
Trainer: train_multilingual.py (cristianochesi/babylm-2026)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k")
model = AutoModelForCausalLM.from_pretrained("NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k")

# English
prompt = "The child looked at"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))

# Dutch
prompt_nl = "Het kind keek naar"
inputs = tokenizer(prompt_nl, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))

# Chinese
prompt_zh = "孩子看着"
inputs = tokenizer(prompt_zh, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0]))

No language identifier is needed. The model infers the language from the input sequence.

Evaluation

This model is evaluated under the BabyLM 2026 multilingual track pipeline. Standard evaluation tasks include:

BLiMP / BLiMP-NL / BLiMP-ZH — syntactic minimal-pair acceptability
(Super)GLUE / GLUE-NL — downstream NLU benchmarks
Perplexity on held-out multilingual test sets

Results will be updated here upon completion of the shared task evaluation.

Limitations

The model is trained on a small, child-scale corpus (≤100 M words per language) and is not intended for production NLP applications.
Performance on low-frequency phenomena will be limited relative to large-scale LMs.
No explicit language control is available; mixing languages within a single prompt may produce unpredictable continuations.
Chinese output quality may differ from English/Dutch due to the lower byte premium and the shared BPE tokenizer's segmentation behaviour for scriptio continua.

Citation

If you use this model or the associated training code, please cite:

@misc{chesi2026babylm,
  author       = {Chesi, Cristiano and {NeTS Lab}},
  title        = {{BabyLM 2026 Multilingual GPT-2 (MorPiece-16K)}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/NeTS-IUSSPavia/babylm2026-ml-gpt2-mop16k}},
  note         = {Submission to the BabyLM 2026 Challenge, Multilingual Track. IUSS Pavia -- NeTS Lab.}
}

Please also cite the MorPiece tokenizer and the BabyLM shared task:

@misc{chesi2024morpiece,
  author       = {Chesi, Cristiano and {NeTS Lab @ IUSS}},
  title        = {{MorPiece: A Morphologically-Aware Tokenizer Based on Yang's Tolerance Principle}},
  year         = {2024},
  howpublished = {\url{https://github.com/cristianochesi/morpiece}}
}

Model Card Contact

Cristiano Chesi — NeTS Lab, IUSS Pavia nets.iusspavia.it

Downloads last month: 1,738

Safetensors

Model size

0.1B params

Tensor type

F32