Pashto BPE Tokenizer (32k)

A Pashto-native BPE tokenizer trained on 24 million unique sentences from the PashtoCorp corpus (1.25 billion words). Provided in two formats: a HuggingFace-native tokenizer (load with AutoTokenizer) and a SentencePiece model for broader compatibility.

Performance

Metric	Value
Vocabulary size	32,000
Pashto BPE fertility	1.2516 tokens/word
XLM-R fertility (Pashto)	1.5018 tokens/word
Token reduction vs XLM-R	16.67% fewer tokens
Training sentences	24,035,371

Fertility measures average tokens per whitespace-split word — lower means more efficient encoding. At 1.2516 tokens/word, this tokenizer uses 16.67% fewer tokens than XLM-R's multilingual tokenizer on Pashto text, which directly reduces compute cost for any transformer model trained with it.

Files

File	Format	Use case
`tokenizer.json`	HuggingFace tokenizers	`AutoTokenizer.from_pretrained()`
`tokenizer_config.json`	HuggingFace config	Required for AutoTokenizer
`special_tokens_map.json`	HuggingFace config	Required for AutoTokenizer
`sentencepiece/pashto_bpe_32k.model`	SentencePiece	MT, ASR, other pipelines
`sentencepiece/pashto_bpe_32k.vocab`	SentencePiece vocab	Human-readable vocabulary

Training details

Setting	Value
Algorithm	Byte Pair Encoding (BPE)
Vocabulary size	32,000
Min subword frequency	2
Training data	24M unique Pashto sentences (PashtoCorp)
Normalisation	NFC Unicode
Pre-tokenisation	Whitespace
Special tokens	`[PAD]`, `[CLS]`, `[SEP]`, `[UNK]`, `[MASK]`

Installation

pip install transformers tokenizers

Usage — HuggingFace tokenizer

Load with AutoTokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

text = "پښتو ژبه د افغانستان او پاکستان د ملیونونو خلکو مورنۍ ژبه ده"
tokens = tokenizer.tokenize(text)
print(tokens)
# ['پښتو', 'ژبه', 'د', 'افغانستان', 'او', 'پاکستان', 'د', 'ملیونونو', 'خلکو', 'مورنۍ', 'ژبه', 'ده']

encoded = tokenizer(text, return_tensors="pt")
print(encoded["input_ids"])

Encode and decode

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

# Encode
ids = tokenizer.encode("د افغانستان خلک د سولې غوښتونکي دي")
print(ids)

# Decode back
text = tokenizer.decode(ids, skip_special_tokens=True)
print(text)

Batch encoding for model training

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

sentences = [
    "کابل د افغانستان پلازمینه ده",
    "د افغانستان خلک د سولې غوښتونکي دي",
]

batch = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt",
)
print(batch["input_ids"].shape)  # (2, 128)

Training a model from scratch with this tokenizer

from transformers import AutoTokenizer, RobertaConfig, RobertaForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

config = RobertaConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

model = RobertaForMaskedLM(config=config)
print(f"Parameters: {model.num_parameters():,}")

Usage — SentencePiece

import sentencepiece as spm
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="ihanif/pashto-tokenizer",
    filename="sentencepiece/pashto_bpe_32k.model",
)

sp = spm.SentencePieceProcessor()
sp.Load(model_path)

text = "پښتو ژبه د افغانستان مورنۍ ژبه ده"
pieces = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)
print(pieces)
print(ids)

# Decode
print(sp.Decode(ids))

SentencePiece with byte fallback (handles any character)

# The model uses byte_fallback=True, so any OOV character
# is safely encoded as UTF-8 bytes rather than [UNK]
text_with_numbers = "د ۲۰۲۶ کال راپور"
print(sp.EncodeAsPieces(text_with_numbers))

Measuring fertility on your own text

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

def fertility(text):
    words  = text.split()
    tokens = tokenizer.tokenize(text)
    return len(tokens) / len(words) if words else 0

print(fertility("د افغانستان د ملت د سولې غوښتنه"))
# → ~1.25 tokens/word

Related resources

Corpus: ihanif/pashto-corpus — 1.25B-word Pashto corpus
Encoder: ihanif/xlmr-pashto — XLM-R continued pre-training on PashtoCorp
KenLM: ihanif/pashto-kenlm-5gram — 5-gram language model
Code: ihanif/corpus_builder

Citation

@misc{rahman2026pashtocorp,
  title        = {PashtoCorp: A 1.25B-Word Corpus, Evaluation Suite, and
                  Reproducible Pipeline for Low-Resource Language Development},
  author       = {Rahman, Hanif},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/ihanif/pashto-tokenizer}},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support