--- language: - ne - en tags: - nepali - devanagari - tokenizer - tokenizer-extension license: other license_name: community-use-1.0 license_link: LICENSE.md --- # Phi-4 — Nepali Extended Tokenizer Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens. ## Token Efficiency | | Nepali tok/word | |---|---:| | Original Phi-4 | 7.10 | | **Extended (this)** | **3.41** | | **Reduction** | **51.9%** | ## How It Was Built 1. Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus 2. Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach) 3. Added ~15K high-value Nepali tokens to the base tokenizer The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the [Qwen3-4B Nepali model](https://huggingface.co/sidskarki/qwen3-4b-nepali) for a full CPT+SFT example). ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer") tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो") print(tokens, len(tokens)) ``` ## Context Part of a [17-model Nepali tokenizer benchmark](https://siddhantskarki.com/case-studies/nepali-tokenizer) measuring the Nepali token tax across modern LLM tokenizers. ## Links - **Code:** [github.com/sidskarkii/nepali-tokenizer](https://github.com/sidskarkii/nepali-tokenizer) - **Case study:** [siddhantskarki.com/case-studies/nepali-tokenizer](https://siddhantskarki.com/case-studies/nepali-tokenizer)