Phi-4 — Nepali Extended Tokenizer

Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens.

Token Efficiency

Nepali tok/word
Original Phi-4 7.10
Extended (this) 3.41
Reduction 51.9%

How It Was Built

  1. Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
  2. Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
  3. Added ~15K high-value Nepali tokens to the base tokenizer

The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the Qwen3-4B Nepali model for a full CPT+SFT example).

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer")
tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
print(tokens, len(tokens))

Context

Part of a 17-model Nepali tokenizer benchmark measuring the Nepali token tax across modern LLM tokenizers.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support