Nepali BPE Tokenizer

SentencePiece BPE tokenizers trained on a 7.49GB cleaned Nepali corpus for Devanagari-optimized tokenization.

Three vocab sizes included: 32K, 48K, and 64K.

Performance

Tokenizer Vocab Size Nepali tok/word
Nepali BPE 32K 32,000 1.34
Nepali BPE 48K 48,000 1.29
Nepali BPE 64K 64,000 1.26

For comparison, English baseline is ~1.25 tokens/word. These tokenizers bring Nepali close to English-level efficiency.

Training Corpus

Assembled from 4 sources, cleaned with Unicode NFC normalization, Devanagari ratio filtering (>50%), control character removal, and paragraph-level deduplication:

  • CulturaX Nepali — ~800M characters
  • Sangraha verified Nepali (AI4Bharat) — ~500M characters
  • CC-100 Nepali — filtered from local archive
  • Nepali books/documents — publicly available text

Total: 7.49GB, 2.83B characters, 18.7M lines.

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("nepali_bpe_32k.model")

tokens = sp.encode("नेपालको राजधानी काठमाडौं हो", out_type=str)
print(tokens)

Context

Built as part of a 17-model Nepali tokenizer benchmark. High-value tokens from the 32K model were used to extend production model tokenizers (Phi-4, Qwen 3.5, DeepSeek V4, Kimi K2.6), reducing their Nepali token counts by 37-52%.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support