| language: | |
| - ne | |
| - en | |
| tags: | |
| - nepali | |
| - devanagari | |
| - tokenizer | |
| - tokenizer-extension | |
| license: other | |
| license_name: community-use-1.0 | |
| license_link: LICENSE.md | |
| # Phi-4 — Nepali Extended Tokenizer | |
| Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens. | |
| ## Token Efficiency | |
| | | Nepali tok/word | | |
| |---|---:| | |
| | Original Phi-4 | 7.10 | | |
| | **Extended (this)** | **3.41** | | |
| | **Reduction** | **51.9%** | | |
| ## How It Was Built | |
| 1. Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus | |
| 2. Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach) | |
| 3. Added ~15K high-value Nepali tokens to the base tokenizer | |
| The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the [Qwen3-4B Nepali model](https://huggingface.co/sidskarki/qwen3-4b-nepali) for a full CPT+SFT example). | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer") | |
| tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो") | |
| print(tokens, len(tokens)) | |
| ``` | |
| ## Context | |
| Part of a [17-model Nepali tokenizer benchmark](https://siddhantskarki.com/case-studies/nepali-tokenizer) measuring the Nepali token tax across modern LLM tokenizers. | |
| ## Links | |
| - **Code:** [github.com/sidskarkii/nepali-tokenizer](https://github.com/sidskarkii/nepali-tokenizer) | |
| - **Case study:** [siddhantskarki.com/case-studies/nepali-tokenizer](https://siddhantskarki.com/case-studies/nepali-tokenizer) | |