Token-Efficient SentencePiece Tokenizers for Turkic Languages
24 SentencePiece BPE tokenizers for Turkish, Azerbaijani, Kazakh, Uzbek, and Kyrgyz, released alongside the paper "Token-Efficient SentencePiece Tokenizers for Turkic Languages" (Mahammadli and Rustamov, 2026).
- GitHub (code + training pipeline + evaluation): https://github.com/eljanmahammadli/turkic-tokenizers
- Paper: under review at IEEE AICT 2026
What is included
| Family | Languages | Vocab sizes | Files |
|---|---|---|---|
tokenizers/lang/{tr,az,kk,uz,ky}/ |
each language separately | 8K, 16K, 32K, 64K | spm_{V}.model + spm_{V}.vocab |
tokenizers/shared/ |
all 5 languages, uniform-sampled union | 8K, 16K, 32K, 64K | spm_{V}.model + spm_{V}.vocab |
All tokenizers are BPE with character coverage 1.0 and SentencePiece's default nmt_nfkc normalization. Special tokens: <unk>, <s>, </s>. Downstream language modeling results in the paper use V = 32K.
Quickstart
from huggingface_hub import hf_hub_download
import sentencepiece as spm
# Per-language tokenizer (recommended for single-language workflows)
path = hf_hub_download(
"eljanmahammadli/turkic-tokenizers",
"tokenizers/lang/az/spm_32000.model",
)
sp = spm.SentencePieceProcessor(model_file=path)
print(sp.encode("Salam, dünya!", out_type=str))
# ['▁Salam', ',', '▁dünya', '!']
# Shared tokenizer (single artifact covering all 5 languages)
path = hf_hub_download(
"eljanmahammadli/turkic-tokenizers",
"tokenizers/shared/spm_32000.model",
)
sp = spm.SentencePieceProcessor(model_file=path)
To get all 24 at once, clone the whole repo with git clone https://huggingface.co/eljanmahammadli/turkic-tokenizers or use snapshot_download.
Headline results from the paper
Under a matched token-position budget on a small decoder-only Transformer:
- Cost-fairness (FLORES-200): production tokenizers encode Turkic text at 1.43–5.72× the English token count. The per-language Turkic tokenizers invert this to 0.54–0.64× and the shared tokenizer to 0.66–0.74×.
- Held-out FLORES BPC: the per-language 32K tokenizer is lowest on every language; nearly all paired comparisons against GPT-2, GPT-4o, and XLM-R survive Bonferroni correction at n=5 seeds.
- Downstream (SIB-200 classification, WikiANN NER): the Turkic tokenizers are stronger than GPT-2/GPT-4o on every cell and competitive with XLM-R across most cells.
See the paper for full tables, error bars, and the character-matched control.
Training data
Tokenizers were trained on a frozen subset of FineWeb-2 (HF snapshot, 2026-04) for each Turkic language: 100M-raw-token training split per language. FLORES-200 devtest is held out and audited for contamination (0 of 5,060 sentences appear in training).
Citation
@inproceedings{mahammadli2026turkic,
author = {Eljan Mahammadli and Samir Rustamov},
title = {Token-Efficient {SentencePiece} Tokenizers for {T}urkic Languages},
booktitle = {Proceedings of the 20th IEEE International Conference on Application of Information and Communication Technologies (AICT)},
address = {Baku, Azerbaijan},
year = {2026},
note = {Under review}
}
License
Apache License 2.0. See LICENSE.
Authors
- Eljan Mahammadli — Khazar University, Baku —
elcan.mahammadli@khazar.org - Samir Rustamov — ADA University, Baku —
srustamov@ada.edu.az