panini-tokenizer / README.md
ArthaLabs's picture
Upload folder using huggingface_hub
9166de1 verified
---
title: Panini Tokenizer
emoji: πŸ”€
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.0
app_file: app.py
language: sa
license: apache-2.0
tags:
- sanskrit
- tokenizer
- nlp
- morphology
- transformers
- linguistics
---
# Panini Tokenizer
**The first grammar-first Sanskrit tokenizer based on Pāṇinian morphological analysis.**
[![Demo](https://img.shields.io/badge/πŸš€_Try_Demo-HuggingFace_Spaces-blueviolet?style=for-the-badge)](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)
> **Why it matters:** *Fewer tokens = more usable context per input = better learning & longer text coverage.*
## 🚨 The Problem
Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model **Sandhi**(phonetic fusion).
* **Standard Models (BERT/Qwen):** fracture complex words into phonetic noise (`##k`, `##z`, `##ab`).
* **Panini Tokenizer:** uses recursive morphological parsing to recover the original **semantic roots** (`nirapekza` + `jYAna`).
## ⚑ Key Features
* πŸ”€ **Vocab:** 128k dictionary-backed tokens (Monier-Williams).
* πŸ”„ **Sandhi Reversal:** Automatically splits fused compounds (e.g., `t` β†’ `d`, `i` β†’ `y`).
* 🧩 **Semantic Atomicism:** Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
* πŸ“‰ **Efficiency:** Reduces token count by **2-4x** compared to multilingual models.
## πŸš€ Quick Start
No custom installation required. Use directly with Hugging Face `transformers`:
**Note:** The model expects **SLP1 transliteration** (e.g., `vidyA`), not Devanagari.
```python
from transformers import AutoTokenizer
# Load with trust_remote_code=True because of custom logic
tokenizer = AutoTokenizer.from_pretrained(
"ArthaLabs/panini-tokenizer",
trust_remote_code=True
)
# Tokenize complex Sandhi compounds (SLP1 input)
text = "nirapekzajYAnasAkzAtkArasAmarthyam"
tokens = tokenizer.tokenize(text)
print(tokens)
```
## πŸ“Š Benchmarks: The "Context Dividend"
By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively **tripling the context window** for downstream tasks.
| Input Compound | **Panini (Ours)** | Google MuRIL | Qwen2 |
| --- | --- | --- | --- |
| `nirapekzajYAnasAkzAtkArasAmarthyam` | **6** | 18 | 25 |
| `tadekaniScitArthavyavasthApanam` | **6** | 13 | 18 |
| `svaprakASatvaparaprakASavyavacCedaH` | **7** | 15 | 22 |
| `svAtantryAbhAvasamucchinnakartRtvanirAsaH` | **8** | 19 | 25 |
### Visual Comparison
**Input:** *Independent-knowledge-direct-realization-capacity*
* **Panini:** `▁nirapekza` | `jYAna` | `sAkzAtkAra` | `sAman` | `arthy` | `am` (6 meaningful roots)
* **Sanskrit-BERT:** `nirape` | `##k` | `##z` | `##a` | `##jya` | `##nas`... (14 noise fragments)
## πŸ“‹ Use Cases
- πŸ” **Sanskrit semantic search**
- πŸ“– **QA over philosophical texts** (Vedanta, Nyaya, etc.)
- πŸ“œ **Long-form verse processing** (epics, puranas)
- πŸ€– **Training Sanskrit LLMs** with cleaner token streams
- πŸ”¬ **Linguistics research** & morphological analysis
## πŸ› οΈ Technical Details
* **Architecture:** Recursive Descent Splitter + Kosha (Dictionary) Lookup.
* **Vocab Size:** 128,000.
* **Fallback:** Deterministic fallback: character-level only when grammar fails
## πŸ“œ Citation
```bibtex
@misc{panini2025,
author = {ArthaLabs},
title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}
}
```
## License
Apache 2.0