--- license: mit language: - az tags: - tokenizer - sentencepiece - bpe - azerbaijani - low-resource --- # AzText Tokenizer (SentencePiece BPE, 16k) A SentencePiece BPE tokenizer trained on a 100,000-document sample of the [AzText](https://huggingface.co/datasets/eljanmahammadli/AzText) curated Azerbaijani corpus. Released with the paper *AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language* (AIDT 2026). ## Specifications - Algorithm: SentencePiece BPE - Vocabulary size: 16,000 - Character coverage: 1.0 - Special tokens: `` (0), `` (1), `` (2) - Wrapper class: `LlamaTokenizer` (compatible with `AutoTokenizer`) ## Compression On a held-out 5,000-document evaluation set drawn from the curated corpus, this tokenizer achieves approximately 0.24 tokens per character on Azerbaijani text. By comparison, GPT-2's tokenizer requires roughly 2.7× more tokens for the same input, and XLM-RoBERTa requires roughly 1.1× more. ## Usage ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("eljanmahammadli/aztext-tokenizer") ids = tok.encode("Salam, dünya! Azərbaycan dilində bir nümunə.") print(tok.convert_ids_to_tokens(ids)) ``` ## Citation ```bibtex @inproceedings{mahammadli2026aztext, title={AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language}, author={Mahammadli, Eljan and Rustamov, Samir}, booktitle={Artificial Intelligence for Digital Transformations (AIDT)}, year={2026} } ``` ## License MIT.