Pashto BPE Tokenizer (32k)

A Pashto-native BPE tokenizer trained on 24 million unique sentences from the PashtoCorp corpus (1.25 billion words). Provided in two formats: a HuggingFace-native tokenizer (load with AutoTokenizer) and a SentencePiece model for broader compatibility.

Performance

Metric Value
Vocabulary size 32,000
Pashto BPE fertility 1.2516 tokens/word
XLM-R fertility (Pashto) 1.5018 tokens/word
Token reduction vs XLM-R 16.67% fewer tokens
Training sentences 24,035,371

Fertility measures average tokens per whitespace-split word — lower means more efficient encoding. At 1.2516 tokens/word, this tokenizer uses 16.67% fewer tokens than XLM-R's multilingual tokenizer on Pashto text, which directly reduces compute cost for any transformer model trained with it.

Files

File Format Use case
tokenizer.json HuggingFace tokenizers AutoTokenizer.from_pretrained()
tokenizer_config.json HuggingFace config Required for AutoTokenizer
special_tokens_map.json HuggingFace config Required for AutoTokenizer
sentencepiece/pashto_bpe_32k.model SentencePiece MT, ASR, other pipelines
sentencepiece/pashto_bpe_32k.vocab SentencePiece vocab Human-readable vocabulary

Training details

Setting Value
Algorithm Byte Pair Encoding (BPE)
Vocabulary size 32,000
Min subword frequency 2
Training data 24M unique Pashto sentences (PashtoCorp)
Normalisation NFC Unicode
Pre-tokenisation Whitespace
Special tokens [PAD], [CLS], [SEP], [UNK], [MASK]

Installation

pip install transformers tokenizers

Usage — HuggingFace tokenizer

Load with AutoTokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

text = "پښتو ژبه د افغانستان او پاکستان د ملیونونو خلکو مورنۍ ژبه ده"
tokens = tokenizer.tokenize(text)
print(tokens)
# ['پښتو', 'ژبه', 'د', 'افغانستان', 'او', 'پاکستان', 'د', 'ملیونونو', 'خلکو', 'مورنۍ', 'ژبه', 'ده']

encoded = tokenizer(text, return_tensors="pt")
print(encoded["input_ids"])

Encode and decode

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

# Encode
ids = tokenizer.encode("د افغانستان خلک د سولې غوښتونکي دي")
print(ids)

# Decode back
text = tokenizer.decode(ids, skip_special_tokens=True)
print(text)

Batch encoding for model training

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

sentences = [
    "کابل د افغانستان پلازمینه ده",
    "د افغانستان خلک د سولې غوښتونکي دي",
]

batch = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt",
)
print(batch["input_ids"].shape)  # (2, 128)

Training a model from scratch with this tokenizer

from transformers import AutoTokenizer, RobertaConfig, RobertaForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

config = RobertaConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

model = RobertaForMaskedLM(config=config)
print(f"Parameters: {model.num_parameters():,}")

Usage — SentencePiece

import sentencepiece as spm
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="ihanif/pashto-tokenizer",
    filename="sentencepiece/pashto_bpe_32k.model",
)

sp = spm.SentencePieceProcessor()
sp.Load(model_path)

text = "پښتو ژبه د افغانستان مورنۍ ژبه ده"
pieces = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)
print(pieces)
print(ids)

# Decode
print(sp.Decode(ids))

SentencePiece with byte fallback (handles any character)

# The model uses byte_fallback=True, so any OOV character
# is safely encoded as UTF-8 bytes rather than [UNK]
text_with_numbers = "د ۲۰۲۶ کال راپور"
print(sp.EncodeAsPieces(text_with_numbers))

Measuring fertility on your own text

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ihanif/pashto-tokenizer")

def fertility(text):
    words  = text.split()
    tokens = tokenizer.tokenize(text)
    return len(tokens) / len(words) if words else 0

print(fertility("د افغانستان د ملت د سولې غوښتنه"))
# → ~1.25 tokens/word

Related resources

Citation

@misc{rahman2026pashtocorp,
  title        = {PashtoCorp: A 1.25B-Word Corpus, Evaluation Suite, and
                  Reproducible Pipeline for Low-Resource Language Development},
  author       = {Rahman, Hanif},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/ihanif/pashto-tokenizer}},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support