File size: 3,743 Bytes
cfc16cf d7e7b74 c819e33 d7e7b74 cfc16cf 5ae226b 5b16a53 9166de1 5ae226b 9166de1 5ae226b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
title: Panini Tokenizer
emoji: π€
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.0
app_file: app.py
language: sa
license: apache-2.0
tags:
- sanskrit
- tokenizer
- nlp
- morphology
- transformers
- linguistics
---
# Panini Tokenizer
**The first grammar-first Sanskrit tokenizer based on PΔαΉinian morphological analysis.**
[](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)
> **Why it matters:** *Fewer tokens = more usable context per input = better learning & longer text coverage.*
## π¨ The Problem
Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model **Sandhi**(phonetic fusion).
* **Standard Models (BERT/Qwen):** fracture complex words into phonetic noise (`##k`, `##z`, `##ab`).
* **Panini Tokenizer:** uses recursive morphological parsing to recover the original **semantic roots** (`nirapekza` + `jYAna`).
## β‘ Key Features
* π€ **Vocab:** 128k dictionary-backed tokens (Monier-Williams).
* π **Sandhi Reversal:** Automatically splits fused compounds (e.g., `t` β `d`, `i` β `y`).
* π§© **Semantic Atomicism:** Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
* π **Efficiency:** Reduces token count by **2-4x** compared to multilingual models.
## π Quick Start
No custom installation required. Use directly with Hugging Face `transformers`:
**Note:** The model expects **SLP1 transliteration** (e.g., `vidyA`), not Devanagari.
```python
from transformers import AutoTokenizer
# Load with trust_remote_code=True because of custom logic
tokenizer = AutoTokenizer.from_pretrained(
"ArthaLabs/panini-tokenizer",
trust_remote_code=True
)
# Tokenize complex Sandhi compounds (SLP1 input)
text = "nirapekzajYAnasAkzAtkArasAmarthyam"
tokens = tokenizer.tokenize(text)
print(tokens)
```
## π Benchmarks: The "Context Dividend"
By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively **tripling the context window** for downstream tasks.
| Input Compound | **Panini (Ours)** | Google MuRIL | Qwen2 |
| --- | --- | --- | --- |
| `nirapekzajYAnasAkzAtkArasAmarthyam` | **6** | 18 | 25 |
| `tadekaniScitArthavyavasthApanam` | **6** | 13 | 18 |
| `svaprakASatvaparaprakASavyavacCedaH` | **7** | 15 | 22 |
| `svAtantryAbhAvasamucchinnakartRtvanirAsaH` | **8** | 19 | 25 |
### Visual Comparison
**Input:** *Independent-knowledge-direct-realization-capacity*
* **Panini:** `βnirapekza` | `jYAna` | `sAkzAtkAra` | `sAman` | `arthy` | `am` (6 meaningful roots)
* **Sanskrit-BERT:** `nirape` | `##k` | `##z` | `##a` | `##jya` | `##nas`... (14 noise fragments)
## π Use Cases
- π **Sanskrit semantic search**
- π **QA over philosophical texts** (Vedanta, Nyaya, etc.)
- π **Long-form verse processing** (epics, puranas)
- π€ **Training Sanskrit LLMs** with cleaner token streams
- π¬ **Linguistics research** & morphological analysis
## π οΈ Technical Details
* **Architecture:** Recursive Descent Splitter + Kosha (Dictionary) Lookup.
* **Vocab Size:** 128,000.
* **Fallback:** Deterministic fallback: character-level only when grammar fails
## π Citation
```bibtex
@misc{panini2025,
author = {ArthaLabs},
title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}
}
```
## License
Apache 2.0
|