| --- |
| license: mit |
| language: |
| - az |
| tags: |
| - tokenizer |
| - sentencepiece |
| - bpe |
| - azerbaijani |
| - low-resource |
| --- |
| |
| # AzText Tokenizer (SentencePiece BPE, 16k) |
|
|
| A SentencePiece BPE tokenizer trained on a 100,000-document sample of the |
| [AzText](https://huggingface.co/datasets/eljanmahammadli/AzText) curated |
| Azerbaijani corpus. |
|
|
| Released with the paper *AzText: Curating Web-Scale Pretraining Data for a |
| Low-Resource Language* (AIDT 2026). |
|
|
| ## Specifications |
|
|
| - Algorithm: SentencePiece BPE |
| - Vocabulary size: 16,000 |
| - Character coverage: 1.0 |
| - Special tokens: `<unk>` (0), `<s>` (1), `</s>` (2) |
| - Wrapper class: `LlamaTokenizer` (compatible with `AutoTokenizer`) |
|
|
| ## Compression |
|
|
| On a held-out 5,000-document evaluation set drawn from the curated corpus, |
| this tokenizer achieves approximately 0.24 tokens per character on |
| Azerbaijani text. By comparison, GPT-2's tokenizer requires roughly 2.7× |
| more tokens for the same input, and XLM-RoBERTa requires roughly 1.1× more. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("eljanmahammadli/aztext-tokenizer") |
| ids = tok.encode("Salam, dünya! Azərbaycan dilində bir nümunə.") |
| print(tok.convert_ids_to_tokens(ids)) |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{mahammadli2026aztext, |
| title={AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language}, |
| author={Mahammadli, Eljan and Rustamov, Samir}, |
| booktitle={Artificial Intelligence for Digital Transformations (AIDT)}, |
| year={2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT. |
|
|