panini-tokenizer / README.md

Upload folder using huggingface_hub

9166de1 verified 13 days ago

3.74 kB

	---
	title: Panini Tokenizer
	emoji: 🔤
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 5.9.0
	app_file: app.py
	language: sa
	license: apache-2.0
	tags:
	- sanskrit
	- tokenizer
	- nlp
	- morphology
	- transformers
	- linguistics
	---
	# Panini Tokenizer

	The first grammar-first Sanskrit tokenizer based on Pāṇinian morphological analysis.

	[![Demo](https://img.shields.io/badge/🚀_Try_Demo-HuggingFace_Spaces-blueviolet?style=for-the-badge)](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)

	> Why it matters: Fewer tokens = more usable context per input = better learning & longer text coverage.

	## 🚨 The Problem

	Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model Sandhi(phonetic fusion).

	* Standard Models (BERT/Qwen): fracture complex words into phonetic noise (`##k`, `##z`, `##ab`).
	* Panini Tokenizer: uses recursive morphological parsing to recover the original semantic roots (`nirapekza` + `jYAna`).

	## ⚡ Key Features

	* 🔤 Vocab: 128k dictionary-backed tokens (Monier-Williams).
	* 🔄 Sandhi Reversal: Automatically splits fused compounds (e.g., `t` → `d`, `i` → `y`).
	* 🧩 Semantic Atomicism: Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
	* 📉 Efficiency: Reduces token count by 2-4x compared to multilingual models.

	## 🚀 Quick Start

	No custom installation required. Use directly with Hugging Face `transformers`:
	Note: The model expects SLP1 transliteration (e.g., `vidyA`), not Devanagari.
	```python
	from transformers import AutoTokenizer

	# Load with trust_remote_code=True because of custom logic
	tokenizer = AutoTokenizer.from_pretrained(
	"ArthaLabs/panini-tokenizer",
	trust_remote_code=True
	)

	# Tokenize complex Sandhi compounds (SLP1 input)
	text = "nirapekzajYAnasAkzAtkArasAmarthyam"
	tokens = tokenizer.tokenize(text)

	print(tokens)
	```

	## 📊 Benchmarks: The "Context Dividend"

	By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively tripling the context window for downstream tasks.

	\| Input Compound \| Panini (Ours) \| Google MuRIL \| Qwen2 \|
	\| --- \| --- \| --- \| --- \|
	\| `nirapekzajYAnasAkzAtkArasAmarthyam` \| 6 \| 18 \| 25 \|
	\| `tadekaniScitArthavyavasthApanam` \| 6 \| 13 \| 18 \|
	\| `svaprakASatvaparaprakASavyavacCedaH` \| 7 \| 15 \| 22 \|
	\| `svAtantryAbhAvasamucchinnakartRtvanirAsaH` \| 8 \| 19 \| 25 \|

	### Visual Comparison

	Input: Independent-knowledge-direct-realization-capacity

	* Panini: `▁nirapekza` \| `jYAna` \| `sAkzAtkAra` \| `sAman` \| `arthy` \| `am` (6 meaningful roots)
	* Sanskrit-BERT: `nirape` \| `##k` \| `##z` \| `##a` \| `##jya` \| `##nas`... (14 noise fragments)

	## 📋 Use Cases

	- 🔍 Sanskrit semantic search
	- 📖 QA over philosophical texts (Vedanta, Nyaya, etc.)
	- 📜 Long-form verse processing (epics, puranas)
	- 🤖 Training Sanskrit LLMs with cleaner token streams
	- 🔬 Linguistics research & morphological analysis

	## 🛠️ Technical Details

	* Architecture: Recursive Descent Splitter + Kosha (Dictionary) Lookup.
	* Vocab Size: 128,000.
	* Fallback: Deterministic fallback: character-level only when grammar fails
	## 📜 Citation

	```bibtex
	@misc{panini2025,
	author = {ArthaLabs},
	title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}
	}
	```

	## License

	Apache 2.0