Token-Efficient SentencePiece Tokenizers for Turkic Languages

24 SentencePiece BPE tokenizers for Turkish, Azerbaijani, Kazakh, Uzbek, and Kyrgyz, released alongside the paper "Token-Efficient SentencePiece Tokenizers for Turkic Languages" (Mahammadli and Rustamov, 2026).

What is included

Family Languages Vocab sizes Files
tokenizers/lang/{tr,az,kk,uz,ky}/ each language separately 8K, 16K, 32K, 64K spm_{V}.model + spm_{V}.vocab
tokenizers/shared/ all 5 languages, uniform-sampled union 8K, 16K, 32K, 64K spm_{V}.model + spm_{V}.vocab

All tokenizers are BPE with character coverage 1.0 and SentencePiece's default nmt_nfkc normalization. Special tokens: <unk>, <s>, </s>. Downstream language modeling results in the paper use V = 32K.

Quickstart

from huggingface_hub import hf_hub_download
import sentencepiece as spm

# Per-language tokenizer (recommended for single-language workflows)
path = hf_hub_download(
    "eljanmahammadli/turkic-tokenizers",
    "tokenizers/lang/az/spm_32000.model",
)
sp = spm.SentencePieceProcessor(model_file=path)
print(sp.encode("Salam, dünya!", out_type=str))
# ['▁Salam', ',', '▁dünya', '!']

# Shared tokenizer (single artifact covering all 5 languages)
path = hf_hub_download(
    "eljanmahammadli/turkic-tokenizers",
    "tokenizers/shared/spm_32000.model",
)
sp = spm.SentencePieceProcessor(model_file=path)

To get all 24 at once, clone the whole repo with git clone https://huggingface.co/eljanmahammadli/turkic-tokenizers or use snapshot_download.

Headline results from the paper

Under a matched token-position budget on a small decoder-only Transformer:

  • Cost-fairness (FLORES-200): production tokenizers encode Turkic text at 1.43–5.72× the English token count. The per-language Turkic tokenizers invert this to 0.54–0.64× and the shared tokenizer to 0.66–0.74×.
  • Held-out FLORES BPC: the per-language 32K tokenizer is lowest on every language; nearly all paired comparisons against GPT-2, GPT-4o, and XLM-R survive Bonferroni correction at n=5 seeds.
  • Downstream (SIB-200 classification, WikiANN NER): the Turkic tokenizers are stronger than GPT-2/GPT-4o on every cell and competitive with XLM-R across most cells.

See the paper for full tables, error bars, and the character-matched control.

Training data

Tokenizers were trained on a frozen subset of FineWeb-2 (HF snapshot, 2026-04) for each Turkic language: 100M-raw-token training split per language. FLORES-200 devtest is held out and audited for contamination (0 of 5,060 sentences appear in training).

Citation

@inproceedings{mahammadli2026turkic,
  author    = {Eljan Mahammadli and Samir Rustamov},
  title     = {Token-Efficient {SentencePiece} Tokenizers for {T}urkic Languages},
  booktitle = {Proceedings of the 20th IEEE International Conference on Application of Information and Communication Technologies (AICT)},
  address   = {Baku, Azerbaijan},
  year      = {2026},
  note      = {Under review}
}

License

Apache License 2.0. See LICENSE.

Authors

  • Eljan Mahammadli — Khazar University, Baku — elcan.mahammadli@khazar.org
  • Samir Rustamov — ADA University, Baku — srustamov@ada.edu.az
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support