|
|
---
|
|
|
title: Panini Tokenizer
|
|
|
emoji: π€
|
|
|
colorFrom: indigo
|
|
|
colorTo: purple
|
|
|
sdk: gradio
|
|
|
sdk_version: 5.9.0
|
|
|
app_file: app.py
|
|
|
language: sa
|
|
|
license: apache-2.0
|
|
|
tags:
|
|
|
- sanskrit
|
|
|
- tokenizer
|
|
|
- nlp
|
|
|
- morphology
|
|
|
- transformers
|
|
|
- linguistics
|
|
|
---
|
|
|
# Panini Tokenizer
|
|
|
|
|
|
**The first grammar-first Sanskrit tokenizer based on PΔαΉinian morphological analysis.**
|
|
|
|
|
|
[](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)
|
|
|
|
|
|
> **Why it matters:** *Fewer tokens = more usable context per input = better learning & longer text coverage.*
|
|
|
|
|
|
## π¨ The Problem
|
|
|
|
|
|
Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model **Sandhi**(phonetic fusion).
|
|
|
|
|
|
* **Standard Models (BERT/Qwen):** fracture complex words into phonetic noise (`##k`, `##z`, `##ab`).
|
|
|
* **Panini Tokenizer:** uses recursive morphological parsing to recover the original **semantic roots** (`nirapekza` + `jYAna`).
|
|
|
|
|
|
## β‘ Key Features
|
|
|
|
|
|
* π€ **Vocab:** 128k dictionary-backed tokens (Monier-Williams).
|
|
|
* π **Sandhi Reversal:** Automatically splits fused compounds (e.g., `t` β `d`, `i` β `y`).
|
|
|
* π§© **Semantic Atomicism:** Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
|
|
|
* π **Efficiency:** Reduces token count by **2-4x** compared to multilingual models.
|
|
|
|
|
|
## π Quick Start
|
|
|
|
|
|
No custom installation required. Use directly with Hugging Face `transformers`:
|
|
|
**Note:** The model expects **SLP1 transliteration** (e.g., `vidyA`), not Devanagari.
|
|
|
```python
|
|
|
from transformers import AutoTokenizer
|
|
|
|
|
|
# Load with trust_remote_code=True because of custom logic
|
|
|
tokenizer = AutoTokenizer.from_pretrained(
|
|
|
"ArthaLabs/panini-tokenizer",
|
|
|
trust_remote_code=True
|
|
|
)
|
|
|
|
|
|
# Tokenize complex Sandhi compounds (SLP1 input)
|
|
|
text = "nirapekzajYAnasAkzAtkArasAmarthyam"
|
|
|
tokens = tokenizer.tokenize(text)
|
|
|
|
|
|
print(tokens)
|
|
|
```
|
|
|
|
|
|
## π Benchmarks: The "Context Dividend"
|
|
|
|
|
|
By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively **tripling the context window** for downstream tasks.
|
|
|
|
|
|
| Input Compound | **Panini (Ours)** | Google MuRIL | Qwen2 |
|
|
|
| --- | --- | --- | --- |
|
|
|
| `nirapekzajYAnasAkzAtkArasAmarthyam` | **6** | 18 | 25 |
|
|
|
| `tadekaniScitArthavyavasthApanam` | **6** | 13 | 18 |
|
|
|
| `svaprakASatvaparaprakASavyavacCedaH` | **7** | 15 | 22 |
|
|
|
| `svAtantryAbhAvasamucchinnakartRtvanirAsaH` | **8** | 19 | 25 |
|
|
|
|
|
|
### Visual Comparison
|
|
|
|
|
|
**Input:** *Independent-knowledge-direct-realization-capacity*
|
|
|
|
|
|
* **Panini:** `βnirapekza` | `jYAna` | `sAkzAtkAra` | `sAman` | `arthy` | `am` (6 meaningful roots)
|
|
|
* **Sanskrit-BERT:** `nirape` | `##k` | `##z` | `##a` | `##jya` | `##nas`... (14 noise fragments)
|
|
|
|
|
|
## π Use Cases
|
|
|
|
|
|
- π **Sanskrit semantic search**
|
|
|
- π **QA over philosophical texts** (Vedanta, Nyaya, etc.)
|
|
|
- π **Long-form verse processing** (epics, puranas)
|
|
|
- π€ **Training Sanskrit LLMs** with cleaner token streams
|
|
|
- π¬ **Linguistics research** & morphological analysis
|
|
|
|
|
|
## π οΈ Technical Details
|
|
|
|
|
|
* **Architecture:** Recursive Descent Splitter + Kosha (Dictionary) Lookup.
|
|
|
* **Vocab Size:** 128,000.
|
|
|
* **Fallback:** Deterministic fallback: character-level only when grammar fails
|
|
|
## π Citation
|
|
|
|
|
|
```bibtex
|
|
|
@misc{panini2025,
|
|
|
author = {ArthaLabs},
|
|
|
title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},
|
|
|
year = {2025},
|
|
|
publisher = {Hugging Face},
|
|
|
howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## License
|
|
|
|
|
|
Apache 2.0
|
|
|
|