File size: 3,743 Bytes

---

title: Panini Tokenizer
emoji: 🔤
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.0
app_file: app.py
language: sa
license: apache-2.0
tags:
- sanskrit
- tokenizer
- nlp
- morphology
- transformers
- linguistics
---

# Panini Tokenizer

**The first grammar-first Sanskrit tokenizer based on Pāṇinian morphological analysis.**

[![Demo](https://img.shields.io/badge/🚀_Try_Demo-HuggingFace_Spaces-blueviolet?style=for-the-badge)](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)

> **Why it matters:** *Fewer tokens = more usable context per input = better learning & longer text coverage.*

## 🚨 The Problem

Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model **Sandhi**(phonetic fusion).

* **Standard Models (BERT/Qwen):** fracture complex words into phonetic noise (`##k`, `##z`, `##ab`).
* **Panini Tokenizer:** uses recursive morphological parsing to recover the original **semantic roots** (`nirapekza` + `jYAna`).

## ⚡ Key Features

* 🔤 **Vocab:** 128k dictionary-backed tokens (Monier-Williams).
* 🔄 **Sandhi Reversal:** Automatically splits fused compounds (e.g., `t` → `d`, `i` → `y`).
* 🧩 **Semantic Atomicism:** Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
* 📉 **Efficiency:** Reduces token count by **2-4x** compared to multilingual models.

## 🚀 Quick Start

No custom installation required. Use directly with Hugging Face `transformers`:
**Note:** The model expects **SLP1 transliteration** (e.g., `vidyA`), not Devanagari.
```python

from transformers import AutoTokenizer



# Load with trust_remote_code=True because of custom logic

tokenizer = AutoTokenizer.from_pretrained(

    "ArthaLabs/panini-tokenizer",

    trust_remote_code=True

)



# Tokenize complex Sandhi compounds (SLP1 input)

text = "nirapekzajYAnasAkzAtkArasAmarthyam"

tokens = tokenizer.tokenize(text)



print(tokens)

```

## 📊 Benchmarks: The "Context Dividend"

By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively **tripling the context window** for downstream tasks.

| Input Compound | **Panini (Ours)** | Google MuRIL | Qwen2 |
| --- | --- | --- | --- |
| `nirapekzajYAnasAkzAtkArasAmarthyam` | **6** | 18 | 25 |
| `tadekaniScitArthavyavasthApanam` | **6** | 13 | 18 |
| `svaprakASatvaparaprakASavyavacCedaH` | **7** | 15 | 22 |
| `svAtantryAbhAvasamucchinnakartRtvanirAsaH` | **8** | 19 | 25 |

### Visual Comparison

**Input:** *Independent-knowledge-direct-realization-capacity*

* **Panini:** `▁nirapekza` | `jYAna` | `sAkzAtkAra` | `sAman` | `arthy` | `am` (6 meaningful roots)
* **Sanskrit-BERT:** `nirape` | `##k` | `##z` | `##a` | `##jya` | `##nas`... (14 noise fragments)

## 📋 Use Cases

- 🔍 **Sanskrit semantic search**
- 📖 **QA over philosophical texts** (Vedanta, Nyaya, etc.)
- 📜 **Long-form verse processing** (epics, puranas)
- 🤖 **Training Sanskrit LLMs** with cleaner token streams
- 🔬 **Linguistics research** & morphological analysis

## 🛠️ Technical Details

* **Architecture:** Recursive Descent Splitter + Kosha (Dictionary) Lookup.
* **Vocab Size:** 128,000.
* **Fallback:** Deterministic fallback: character-level only when grammar fails
## 📜 Citation

```bibtex

@misc{panini2025,

  author = {ArthaLabs},

  title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},

  year = {2025},

  publisher = {Hugging Face},

  howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}

}

```

## License

Apache 2.0