sidskarki
/

phi4-nepali-tokenizer

tokenizer-extension

Model card Files Files and versions

phi4-nepali-tokenizer / README.md

sidskarki's picture

Update license to Community Use License v1.0

af73ab1 verified 24 days ago

|

history blame contribute delete

1.66 kB

	---
	language:
	- ne
	- en
	tags:
	- nepali
	- devanagari
	- tokenizer
	- tokenizer-extension
	license: other
	license_name: community-use-1.0
	license_link: LICENSE.md
	---

	# Phi-4 — Nepali Extended Tokenizer

	Extended tokenizer for Phi-4 with ~15K added high-value Nepali/Devanagari tokens.

	## Token Efficiency

	\| \| Nepali tok/word \|
	\|---\|---:\|
	\| Original Phi-4 \| 7.10 \|
	\| Extended (this) \| 3.41 \|
	\| Reduction \| 51.9% \|

	## How It Was Built

	1. Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
	2. Selected tokens that the Phi-4 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
	3. Added ~15K high-value Nepali tokens to the base tokenizer

	The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the [Qwen3-4B Nepali model](https://huggingface.co/sidskarki/qwen3-4b-nepali) for a full CPT+SFT example).

	## Usage

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("sidskarki/phi4-nepali-tokenizer")
	tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
	print(tokens, len(tokens))
	```

	## Context

	Part of a [17-model Nepali tokenizer benchmark](https://siddhantskarki.com/case-studies/nepali-tokenizer) measuring the Nepali token tax across modern LLM tokenizers.

	## Links

	- Code: [github.com/sidskarkii/nepali-tokenizer](https://github.com/sidskarkii/nepali-tokenizer)
	- Case study: [siddhantskarki.com/case-studies/nepali-tokenizer](https://siddhantskarki.com/case-studies/nepali-tokenizer)