Phi-4 — Nepali Extended Tokenizer
Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens.
Token Efficiency
| Nepali tok/word | |
|---|---|
| Original Phi-4 | 7.10 |
| Extended (this) | 3.41 |
| Reduction | 51.9% |
How It Was Built
- Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
- Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
- Added ~15K high-value Nepali tokens to the base tokenizer
The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the Qwen3-4B Nepali model for a full CPT+SFT example).
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer")
tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
print(tokens, len(tokens))
Context
Part of a 17-model Nepali tokenizer benchmark measuring the Nepali token tax across modern LLM tokenizers.
Links
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support