Pashto BPE Tokenizer (32k)
A Pashto-native BPE tokenizer trained on 24 million unique sentences from the
PashtoCorp corpus
(1.25 billion words). Provided in two formats: a HuggingFace-native tokenizer
(load with AutoTokenizer) and a SentencePiece model for broader compatibility.
Performance
| Metric | Value |
|---|---|
| Vocabulary size | 32,000 |
| Pashto BPE fertility | 1.2516 tokens/word |
| XLM-R fertility (Pashto) | 1.5018 tokens/word |
| Token reduction vs XLM-R | 16.67% fewer tokens |
| Training sentences | 24,035,371 |
Fertility measures average tokens per whitespace-split word — lower means more efficient encoding. At 1.2516 tokens/word, this tokenizer uses 16.67% fewer tokens than XLM-R's multilingual tokenizer on Pashto text, which directly reduces compute cost for any transformer model trained with it.
Files
| File | Format | Use case |
|---|---|---|
tokenizer.json |
HuggingFace tokenizers | AutoTokenizer.from_pretrained() |
tokenizer_config.json |
HuggingFace config | Required for AutoTokenizer |
special_tokens_map.json |
HuggingFace config | Required for AutoTokenizer |
sentencepiece/pashto_bpe_32k.model |
SentencePiece | MT, ASR, other pipelines |
sentencepiece/pashto_bpe_32k.vocab |
SentencePiece vocab | Human-readable vocabulary |
Training details
| Setting | Value |
|---|---|
| Algorithm | Byte Pair Encoding (BPE) |
| Vocabulary size | 32,000 |
| Min subword frequency | 2 |
| Training data | 24M unique Pashto sentences (PashtoCorp) |
| Normalisation | NFC Unicode |
| Pre-tokenisation | Whitespace |
| Special tokens | [PAD], [CLS], [SEP], [UNK], [MASK] |
Installation
pip install transformers tokenizers
Usage — HuggingFace tokenizer
Load with AutoTokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")
text = "پښتو ژبه د افغانستان او پاکستان د ملیونونو خلکو مورنۍ ژبه ده"
tokens = tokenizer.tokenize(text)
print(tokens)
# ['پښتو', 'ژبه', 'د', 'افغانستان', 'او', 'پاکستان', 'د', 'ملیونونو', 'خلکو', 'مورنۍ', 'ژبه', 'ده']
encoded = tokenizer(text, return_tensors="pt")
print(encoded["input_ids"])
Encode and decode
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")
# Encode
ids = tokenizer.encode("د افغانستان خلک د سولې غوښتونکي دي")
print(ids)
# Decode back
text = tokenizer.decode(ids, skip_special_tokens=True)
print(text)
Batch encoding for model training
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")
sentences = [
"کابل د افغانستان پلازمینه ده",
"د افغانستان خلک د سولې غوښتونکي دي",
]
batch = tokenizer(
sentences,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt",
)
print(batch["input_ids"].shape) # (2, 128)
Training a model from scratch with this tokenizer
from transformers import AutoTokenizer, RobertaConfig, RobertaForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")
config = RobertaConfig(
vocab_size=tokenizer.vocab_size,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
model = RobertaForMaskedLM(config=config)
print(f"Parameters: {model.num_parameters():,}")
Usage — SentencePiece
import sentencepiece as spm
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="ihanif/pashto-tokenizer",
filename="sentencepiece/pashto_bpe_32k.model",
)
sp = spm.SentencePieceProcessor()
sp.Load(model_path)
text = "پښتو ژبه د افغانستان مورنۍ ژبه ده"
pieces = sp.EncodeAsPieces(text)
ids = sp.EncodeAsIds(text)
print(pieces)
print(ids)
# Decode
print(sp.Decode(ids))
SentencePiece with byte fallback (handles any character)
# The model uses byte_fallback=True, so any OOV character
# is safely encoded as UTF-8 bytes rather than [UNK]
text_with_numbers = "د ۲۰۲۶ کال راپور"
print(sp.EncodeAsPieces(text_with_numbers))
Measuring fertility on your own text
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")
def fertility(text):
words = text.split()
tokens = tokenizer.tokenize(text)
return len(tokens) / len(words) if words else 0
print(fertility("د افغانستان د ملت د سولې غوښتنه"))
# → ~1.25 tokens/word
Related resources
- Corpus: ihanif/pashto-corpus — 1.25B-word Pashto corpus
- Encoder: ihanif/xlmr-pashto — XLM-R continued pre-training on PashtoCorp
- KenLM: ihanif/pashto-kenlm-5gram — 5-gram language model
- Code: ihanif/corpus_builder
Citation
@misc{rahman2026pashtocorp,
title = {PashtoCorp: A 1.25B-Word Corpus, Evaluation Suite, and
Reproducible Pipeline for Low-Resource Language Development},
author = {Rahman, Hanif},
year = {2026},
howpublished = {\url{https://huggingface.co/ihanif/pashto-tokenizer}},
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support