Pan-Turkic BPE Tokenizer

A SentencePiece BPE tokenizer with 65,536 vocabulary size, purpose-built for the Turkic language family. Covers Latin, Cyrillic, and Arabic scripts used across Turkic languages.

Overview

Most existing tokenizers fail on Turkic languages outside of Turkish — particularly on Cyrillic-script languages like Kazakh, Kyrgyz, Bashkir, and Tatar, where they fall back to byte-level tokenization. This tokenizer was trained specifically on a pan-Turkic corpus covering 20+ languages, and handles all major scripts natively.

Languages with strong coverage:

Language	Script
Turkish	Latin
Kazakh	Cyrillic
Kyrgyz	Cyrillic
Uzbek	Latin
Uyghur	Arabic
Bashkir	Cyrillic
Tatar	Cyrillic
Azerbaijani	Latin
Crimean Tatar	Latin
Turkmen	Latin

FLORES-200 Fertility Benchmark

Fertility = average tokens per word (lower is better). Evaluated on 1,012 sentences per language from the FLORES-200 devtest set.

Language	Ours	Kumru-2B	GPT-2	mT5	NLLB-200	XLM-R
Turkish	1.78	1.59	3.79	2.16	2.00	1.83
Kazakh (Cyrl)	1.79	10.96	9.25	2.35	2.09	2.04
Kyrgyz (Cyrl)	1.73	11.18	8.95	2.57	2.21	2.07
Uzbek (Latn)	1.96	3.40	3.44	2.57	2.24	2.26
Uyghur (Arab)	1.72	9.42	10.92	4.91	2.45	2.46
Bashkir (Cyrl)	1.92	10.93	9.07	3.01	2.10	3.52
Tatar (Cyrl)	1.88	10.74	8.72	2.63	2.06	3.07
Azerbaijani (Latn)	1.72	3.34	4.92	2.40	2.16	1.86
Crimean Tatar (Latn)	2.19	2.61	3.75	2.49	2.14	2.36
Turkmen (Latn)	2.48	3.56	4.27	3.18	2.33	3.05
Turkic Avg (10 langs)	1.92	6.77	6.71	2.83	2.18	2.45
English	2.27	2.01	1.24	1.55	1.41	1.41

Vocab sizes: Ours 65,536 · Kumru-2B 50,176 · GPT-2 50,257 · mT5 250,100 · NLLB-200 256,204 · XLM-R 250,002

Key result: Best on 7 of 10 Turkic languages. Achieves similar Turkic coverage to NLLB-200 (256K vocab) with a 4× smaller vocabulary.

For Cyrillic-script Turkic languages (Kazakh, Kyrgyz, Bashkir, Tatar), competing tokenizers degrade to byte-level encoding (10–11 tokens/word). This tokenizer maintains ~1.8 tokens/word on the same languages.

Notable Examples

Morphologically complex Turkish words encode efficiently:

"Cumhurbaşkanlığı"      → 1 token   # (Presidency)
"yapamayacaklarından"   → 3 tokens  # (from those they cannot do)
"sağlıklaştırılamayabileceklerden" → 6 tokens

Perfect round-trip for all supported scripts:

encode → decode  # lossless for Latin, Cyrillic, and Arabic Turkic scripts

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ArinUmut/pan-turkic-tokenizer")

# Turkish
tokenizer.encode("Türkiye Cumhuriyeti")

# Kazakh (Cyrillic)
tokenizer.encode("Алматы Қазақстанның ең үлкен қаласы")

# Uyghur (Arabic)
tokenizer.encode("بىز ئۇيغۇر تىلىدە سۆزلىشىمىز")

Specs

Property	Value
Type	SentencePiece BPE
Vocabulary size	65,536
Scripts	Latin, Cyrillic, Arabic
Languages trained on	20+ Turkic languages
Benchmark	FLORES-200 devtest

Limitations

English fertility (2.27) is higher than English-specialized tokenizers, as the vocabulary is optimized for Turkic languages.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support