---
license: mit
language:
- az
tags:
- tokenizer
- sentencepiece
- bpe
- azerbaijani
- low-resource
---

# AzText Tokenizer (SentencePiece BPE, 16k)

A SentencePiece BPE tokenizer trained on a 100,000-document sample of the
[AzText](https://huggingface.co/datasets/eljanmahammadli/AzText) curated
Azerbaijani corpus.

Released with the paper *AzText: Curating Web-Scale Pretraining Data for a
Low-Resource Language* (AIDT 2026).

## Specifications

- Algorithm: SentencePiece BPE
- Vocabulary size: 16,000
- Character coverage: 1.0
- Special tokens: `<unk>` (0), `<s>` (1), `</s>` (2)
- Wrapper class: `LlamaTokenizer` (compatible with `AutoTokenizer`)

## Compression

On a held-out 5,000-document evaluation set drawn from the curated corpus,
this tokenizer achieves approximately 0.24 tokens per character on
Azerbaijani text. By comparison, GPT-2's tokenizer requires roughly 2.7×
more tokens for the same input, and XLM-RoBERTa requires roughly 1.1× more.

## Usage

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("eljanmahammadli/aztext-tokenizer")
ids = tok.encode("Salam, dünya! Azərbaycan dilində bir nümunə.")
print(tok.convert_ids_to_tokens(ids))
```

## Citation

```bibtex
@inproceedings{mahammadli2026aztext,
  title={AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language},
  author={Mahammadli, Eljan and Rustamov, Samir},
  booktitle={Artificial Intelligence for Digital Transformations (AIDT)},
  year={2026}
}
```

## License

MIT.