aztext-tokenizer / README.md
eljanmahammadli's picture
Initial release: SentencePiece BPE 16k tokenizer for AzText
51b6b2e verified
---
license: mit
language:
- az
tags:
- tokenizer
- sentencepiece
- bpe
- azerbaijani
- low-resource
---
# AzText Tokenizer (SentencePiece BPE, 16k)
A SentencePiece BPE tokenizer trained on a 100,000-document sample of the
[AzText](https://huggingface.co/datasets/eljanmahammadli/AzText) curated
Azerbaijani corpus.
Released with the paper *AzText: Curating Web-Scale Pretraining Data for a
Low-Resource Language* (AIDT 2026).
## Specifications
- Algorithm: SentencePiece BPE
- Vocabulary size: 16,000
- Character coverage: 1.0
- Special tokens: `<unk>` (0), `<s>` (1), `</s>` (2)
- Wrapper class: `LlamaTokenizer` (compatible with `AutoTokenizer`)
## Compression
On a held-out 5,000-document evaluation set drawn from the curated corpus,
this tokenizer achieves approximately 0.24 tokens per character on
Azerbaijani text. By comparison, GPT-2's tokenizer requires roughly 2.7×
more tokens for the same input, and XLM-RoBERTa requires roughly 1.1× more.
## Usage
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("eljanmahammadli/aztext-tokenizer")
ids = tok.encode("Salam, dünya! Azərbaycan dilində bir nümunə.")
print(tok.convert_ids_to_tokens(ids))
```
## Citation
```bibtex
@inproceedings{mahammadli2026aztext,
title={AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language},
author={Mahammadli, Eljan and Rustamov, Samir},
booktitle={Artificial Intelligence for Digital Transformations (AIDT)},
year={2026}
}
```
## License
MIT.