Initial release: SentencePiece BPE 16k tokenizer for AzText

51b6b2e verified 26 days ago

1.52 kB

	---
	license: mit
	language:
	- az
	tags:
	- tokenizer
	- sentencepiece
	- bpe
	- azerbaijani
	- low-resource
	---

	# AzText Tokenizer (SentencePiece BPE, 16k)

	A SentencePiece BPE tokenizer trained on a 100,000-document sample of the
	[AzText](https://huggingface.co/datasets/eljanmahammadli/AzText) curated
	Azerbaijani corpus.

	Released with the paper *AzText: Curating Web-Scale Pretraining Data for a
	Low-Resource Language* (AIDT 2026).

	## Specifications

	- Algorithm: SentencePiece BPE
	- Vocabulary size: 16,000
	- Character coverage: 1.0
	- Special tokens: `<unk>` (0), `<s>` (1), `</s>` (2)
	- Wrapper class: `LlamaTokenizer` (compatible with `AutoTokenizer`)

	## Compression

	On a held-out 5,000-document evaluation set drawn from the curated corpus,
	this tokenizer achieves approximately 0.24 tokens per character on
	Azerbaijani text. By comparison, GPT-2's tokenizer requires roughly 2.7×
	more tokens for the same input, and XLM-RoBERTa requires roughly 1.1× more.

	## Usage

	```python
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained("eljanmahammadli/aztext-tokenizer")
	ids = tok.encode("Salam, dünya! Azərbaycan dilində bir nümunə.")
	print(tok.convert_ids_to_tokens(ids))
	```

	## Citation

	```bibtex
	@inproceedings{mahammadli2026aztext,
	title={AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language},
	author={Mahammadli, Eljan and Rustamov, Samir},
	booktitle={Artificial Intelligence for Digital Transformations (AIDT)},
	year={2026}
	}
	```

	## License

	MIT.