Nepali BPE Tokenizer

SentencePiece BPE tokenizers trained on a 7.49GB cleaned Nepali corpus for Devanagari-optimized tokenization.

Three vocab sizes included: 32K, 48K, and 64K.

Performance

Tokenizer	Vocab Size	Nepali tok/word
Nepali BPE 32K	32,000	1.34
Nepali BPE 48K	48,000	1.29
Nepali BPE 64K	64,000	1.26

For comparison, English baseline is ~1.25 tokens/word. These tokenizers bring Nepali close to English-level efficiency.

Training Corpus

Assembled from 4 sources, cleaned with Unicode NFC normalization, Devanagari ratio filtering (>50%), control character removal, and paragraph-level deduplication:

CulturaX Nepali — ~800M characters
Sangraha verified Nepali (AI4Bharat) — ~500M characters
CC-100 Nepali — filtered from local archive
Nepali books/documents — publicly available text

Total: 7.49GB, 2.83B characters, 18.7M lines.

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("nepali_bpe_32k.model")

tokens = sp.encode("नेपालको राजधानी काठमाडौं हो", out_type=str)
print(tokens)

Context

Built as part of a 17-model Nepali tokenizer benchmark. High-value tokens from the 32K model were used to extend production model tokenizers (Phi-4, Qwen 3.5, DeepSeek V4, Kimi K2.6), reducing their Nepali token counts by 37-52%.

sidskarki
/

nepali-bpe-tokenizer

Nepali BPE Tokenizer

Performance

Training Corpus

Usage

Context

Links