Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

babylm-zho-pinyin-code-97M

A GPT-style causal language model for the 2026 Chinese BabyLM Challenge, trained on a pinyin-code representation of Mandarin instead of Hanzi.

Each Mandarin word is segmented with jieba and every syllable is encoded as a compact initial+digit token that preserves the pinyin initial, a tone group, and a syllable-length bucket — e.g. 我们 → W6M7. It is a lossy, phonology-first encoding.

This is the larger, better-performing model (the 33.4M version is the small baseline).

  • Architecture: 12 layers · 12 heads · 768 hidden · 512 context · 16k SentencePiece-BPE vocab (~97.7M params)
  • Training data: BabyLM-zho (~73M pinyin-code tokens), 20 epochs

Evaluation — chinese-babylm-eval-pipeline

Task Score vs 33.4M
ZhoBLiMP (acc) 72.68 +3.9
Hanzi-structure (acc) 55.00 ~
Hanzi-pinyin (acc) 95.80 +0.4
AFQMC (acc) 68.54 ~
OCNLI (acc) 61.80 +4.2
TNEWS (acc) 52.26 +1.5
CLUEWSC (acc) 63.49 +1.0
word_fmri (corr) 0.554 ~
fmri (corr) 0.086 ~

For reference, the eval pipeline's Qwen3-0.6B baseline scores 71.67 on ZhoBLiMP — this 97.7M pinyin-code model reaches 72.68.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "CPSPX/babylm-zho-pinyin-code-97M"
model = AutoModelForCausalLM.from_pretrained(name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(name, trust_remote_code=True)

# The tokenizer auto-converts raw Mandarin into the pinyin-code representation:
inputs = tokenizer("我们一起来看电影吧", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(out[0]))

trust_remote_code=True is required — the architecture and the transliterating tokenizer are custom. Source: https://github.com/tbhrobrecht/babylm-pinyin-abbreviations

Downloads last month
6,304
Safetensors
Model size
97.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support