Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

babylm-zho-pinyin-code-33M

A compact GPT-style causal language model for the 2026 Chinese BabyLM Challenge, trained on a pinyin-code representation of Mandarin instead of Hanzi.

Each Mandarin word is segmented with jieba and every syllable is encoded as a compact initial+digit token that preserves the pinyin initial, a tone group, and a syllable-length bucket — e.g. 我们 → W6M7. It is a lossy, phonology-first encoding.

  • Architecture: 8 layers · 8 heads · 512 hidden · 512 context · 16k SentencePiece-BPE vocab (~33.4M params)
  • Training data: BabyLM-zho (~73M pinyin-code tokens), 5 epochs
  • Bigger sibling: CPSPX/babylm-zho-pinyin-code-97M

Evaluation — chinese-babylm-eval-pipeline

Task Score
ZhoBLiMP (acc) 68.79
Hanzi-structure (acc) 55.10
Hanzi-pinyin (acc) 95.35
AFQMC (acc) 69.00
OCNLI (acc) 57.63
TNEWS (acc) 50.73
CLUEWSC (acc) 62.50
word_fmri (corr) 0.556
fmri (corr) 0.089

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "CPSPX/babylm-zho-pinyin-code-33M"
model = AutoModelForCausalLM.from_pretrained(name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(name, trust_remote_code=True)

# The tokenizer auto-converts raw Mandarin into the pinyin-code representation:
inputs = tokenizer("我们一起来看电影吧", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(out[0]))

trust_remote_code=True is required — the architecture and the transliterating tokenizer are custom. Source: https://github.com/tbhrobrecht/babylm-pinyin-abbreviations

Downloads last month
6,198
Safetensors
Model size
33.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support