Pan-Turkic BPE Tokenizer

A SentencePiece BPE tokenizer with 65,536 vocabulary size, purpose-built for the Turkic language family. Covers Latin, Cyrillic, and Arabic scripts used across Turkic languages.

Overview

Most existing tokenizers fail on Turkic languages outside of Turkish — particularly on Cyrillic-script languages like Kazakh, Kyrgyz, Bashkir, and Tatar, where they fall back to byte-level tokenization. This tokenizer was trained specifically on a pan-Turkic corpus covering 20+ languages, and handles all major scripts natively.

Languages with strong coverage:

Language Script
Turkish Latin
Kazakh Cyrillic
Kyrgyz Cyrillic
Uzbek Latin
Uyghur Arabic
Bashkir Cyrillic
Tatar Cyrillic
Azerbaijani Latin
Crimean Tatar Latin
Turkmen Latin

FLORES-200 Fertility Benchmark

Fertility = average tokens per word (lower is better). Evaluated on 1,012 sentences per language from the FLORES-200 devtest set.

Language Ours Kumru-2B GPT-2 mT5 NLLB-200 XLM-R
Turkish 1.78 1.59 3.79 2.16 2.00 1.83
Kazakh (Cyrl) 1.79 10.96 9.25 2.35 2.09 2.04
Kyrgyz (Cyrl) 1.73 11.18 8.95 2.57 2.21 2.07
Uzbek (Latn) 1.96 3.40 3.44 2.57 2.24 2.26
Uyghur (Arab) 1.72 9.42 10.92 4.91 2.45 2.46
Bashkir (Cyrl) 1.92 10.93 9.07 3.01 2.10 3.52
Tatar (Cyrl) 1.88 10.74 8.72 2.63 2.06 3.07
Azerbaijani (Latn) 1.72 3.34 4.92 2.40 2.16 1.86
Crimean Tatar (Latn) 2.19 2.61 3.75 2.49 2.14 2.36
Turkmen (Latn) 2.48 3.56 4.27 3.18 2.33 3.05
Turkic Avg (10 langs) 1.92 6.77 6.71 2.83 2.18 2.45
English 2.27 2.01 1.24 1.55 1.41 1.41

Vocab sizes: Ours 65,536 · Kumru-2B 50,176 · GPT-2 50,257 · mT5 250,100 · NLLB-200 256,204 · XLM-R 250,002

Key result: Best on 7 of 10 Turkic languages. Achieves similar Turkic coverage to NLLB-200 (256K vocab) with a 4× smaller vocabulary.

For Cyrillic-script Turkic languages (Kazakh, Kyrgyz, Bashkir, Tatar), competing tokenizers degrade to byte-level encoding (10–11 tokens/word). This tokenizer maintains ~1.8 tokens/word on the same languages.

Notable Examples

Morphologically complex Turkish words encode efficiently:

"Cumhurbaşkanlığı"1 token   # (Presidency)
"yapamayacaklarından"3 tokens  # (from those they cannot do)
"sağlıklaştırılamayabileceklerden"6 tokens

Perfect round-trip for all supported scripts:

encode → decode  # lossless for Latin, Cyrillic, and Arabic Turkic scripts

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ArinUmut/pan-turkic-tokenizer")

# Turkish
tokenizer.encode("Türkiye Cumhuriyeti")

# Kazakh (Cyrillic)
tokenizer.encode("Алматы Қазақстанның ең үлкен қаласы")

# Uyghur (Arabic)
tokenizer.encode("بىز ئۇيغۇر تىلىدە سۆزلىشىمىز")

Specs

Property Value
Type SentencePiece BPE
Vocabulary size 65,536
Scripts Latin, Cyrillic, Arabic
Languages trained on 20+ Turkic languages
Benchmark FLORES-200 devtest

Limitations

  • English fertility (2.27) is higher than English-specialized tokenizers, as the vocabulary is optimized for Turkic languages.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support