---
language:
- ne
- en
tags:
- nepali
- devanagari
- tokenizer
- tokenizer-extension
license: other
license_name: community-use-1.0
license_link: LICENSE.md
---

# Phi-4 — Nepali Extended Tokenizer

Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens.

## Token Efficiency

| | Nepali tok/word |
|---|---:|
| Original Phi-4 | 7.10 |
| **Extended (this)** | **3.41** |
| **Reduction** | **51.9%** |

## How It Was Built

1. Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
2. Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
3. Added ~15K high-value Nepali tokens to the base tokenizer

The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the [Qwen3-4B Nepali model](https://huggingface.co/sidskarki/qwen3-4b-nepali) for a full CPT+SFT example).

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer")
tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
print(tokens, len(tokens))
```

## Context

Part of a [17-model Nepali tokenizer benchmark](https://siddhantskarki.com/case-studies/nepali-tokenizer) measuring the Nepali token tax across modern LLM tokenizers.

## Links

- **Code:** [github.com/sidskarkii/nepali-tokenizer](https://github.com/sidskarkii/nepali-tokenizer)
- **Case study:** [siddhantskarki.com/case-studies/nepali-tokenizer](https://siddhantskarki.com/case-studies/nepali-tokenizer)