Phi-4 — Nepali Extended Tokenizer

Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens.

Token Efficiency

	Nepali tok/word
Original Phi-4	7.10
Extended (this)	3.41
Reduction	51.9%

How It Was Built

Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
Added ~15K high-value Nepali tokens to the base tokenizer

The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the Qwen3-4B Nepali model for a full CPT+SFT example).

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer")
tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
print(tokens, len(tokens))

Context

Part of a 17-model Nepali tokenizer benchmark measuring the Nepali token tax across modern LLM tokenizers.

sidskarki
/

phi4-nepali-tokenizer

Phi-4 — Nepali Extended Tokenizer

Token Efficiency

How It Was Built

Usage

Context

Links