Nepali BPE Tokenizer
SentencePiece BPE tokenizers trained on a 7.49GB cleaned Nepali corpus for Devanagari-optimized tokenization.
Three vocab sizes included: 32K, 48K, and 64K.
Performance
| Tokenizer | Vocab Size | Nepali tok/word |
|---|---|---|
| Nepali BPE 32K | 32,000 | 1.34 |
| Nepali BPE 48K | 48,000 | 1.29 |
| Nepali BPE 64K | 64,000 | 1.26 |
For comparison, English baseline is ~1.25 tokens/word. These tokenizers bring Nepali close to English-level efficiency.
Training Corpus
Assembled from 4 sources, cleaned with Unicode NFC normalization, Devanagari ratio filtering (>50%), control character removal, and paragraph-level deduplication:
- CulturaX Nepali — ~800M characters
- Sangraha verified Nepali (AI4Bharat) — ~500M characters
- CC-100 Nepali — filtered from local archive
- Nepali books/documents — publicly available text
Total: 7.49GB, 2.83B characters, 18.7M lines.
Usage
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("nepali_bpe_32k.model")
tokens = sp.encode("नेपालको राजधानी काठमाडौं हो", out_type=str)
print(tokens)
Context
Built as part of a 17-model Nepali tokenizer benchmark. High-value tokens from the 32K model were used to extend production model tokenizers (Phi-4, Qwen 3.5, DeepSeek V4, Kimi K2.6), reducing their Nepali token counts by 37-52%.
Links
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support