sidskarki's picture
Update license to Community Use License v1.0
af73ab1 verified
---
language:
- ne
- en
tags:
- nepali
- devanagari
- tokenizer
- tokenizer-extension
license: other
license_name: community-use-1.0
license_link: LICENSE.md
---
# Phi-4 — Nepali Extended Tokenizer
Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens.
## Token Efficiency
| | Nepali tok/word |
|---|---:|
| Original Phi-4 | 7.10 |
| **Extended (this)** | **3.41** |
| **Reduction** | **51.9%** |
## How It Was Built
1. Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
2. Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
3. Added ~15K high-value Nepali tokens to the base tokenizer
The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the [Qwen3-4B Nepali model](https://huggingface.co/sidskarki/qwen3-4b-nepali) for a full CPT+SFT example).
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer")
tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
print(tokens, len(tokens))
```
## Context
Part of a [17-model Nepali tokenizer benchmark](https://siddhantskarki.com/case-studies/nepali-tokenizer) measuring the Nepali token tax across modern LLM tokenizers.
## Links
- **Code:** [github.com/sidskarkii/nepali-tokenizer](https://github.com/sidskarkii/nepali-tokenizer)
- **Case study:** [siddhantskarki.com/case-studies/nepali-tokenizer](https://siddhantskarki.com/case-studies/nepali-tokenizer)