File size: 3,743 Bytes
cfc16cf
d7e7b74
 
 
 
 
c819e33
d7e7b74
cfc16cf
 
 
 
 
 
 
 
 
 
5ae226b
 
 
 
5b16a53
 
9166de1
 
5ae226b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9166de1
 
 
 
 
 
 
 
5ae226b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---

title: Panini Tokenizer
emoji: πŸ”€
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.0
app_file: app.py
language: sa
license: apache-2.0
tags:
- sanskrit
- tokenizer
- nlp
- morphology
- transformers
- linguistics
---

# Panini Tokenizer

**The first grammar-first Sanskrit tokenizer based on Pāṇinian morphological analysis.**

[![Demo](https://img.shields.io/badge/πŸš€_Try_Demo-HuggingFace_Spaces-blueviolet?style=for-the-badge)](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)

> **Why it matters:** *Fewer tokens = more usable context per input = better learning & longer text coverage.*

## 🚨 The Problem

Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model **Sandhi**(phonetic fusion).

* **Standard Models (BERT/Qwen):** fracture complex words into phonetic noise (`##k`, `##z`, `##ab`).
* **Panini Tokenizer:** uses recursive morphological parsing to recover the original **semantic roots** (`nirapekza` + `jYAna`).

## ⚑ Key Features

* πŸ”€ **Vocab:** 128k dictionary-backed tokens (Monier-Williams).
* πŸ”„ **Sandhi Reversal:** Automatically splits fused compounds (e.g., `t` β†’ `d`, `i` β†’ `y`).
* 🧩 **Semantic Atomicism:** Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
* πŸ“‰ **Efficiency:** Reduces token count by **2-4x** compared to multilingual models.

## πŸš€ Quick Start

No custom installation required. Use directly with Hugging Face `transformers`:
**Note:** The model expects **SLP1 transliteration** (e.g., `vidyA`), not Devanagari.
```python

from transformers import AutoTokenizer



# Load with trust_remote_code=True because of custom logic

tokenizer = AutoTokenizer.from_pretrained(

    "ArthaLabs/panini-tokenizer",

    trust_remote_code=True

)



# Tokenize complex Sandhi compounds (SLP1 input)

text = "nirapekzajYAnasAkzAtkArasAmarthyam"

tokens = tokenizer.tokenize(text)



print(tokens)

```

## πŸ“Š Benchmarks: The "Context Dividend"

By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively **tripling the context window** for downstream tasks.

| Input Compound | **Panini (Ours)** | Google MuRIL | Qwen2 |
| --- | --- | --- | --- |
| `nirapekzajYAnasAkzAtkArasAmarthyam` | **6** | 18 | 25 |
| `tadekaniScitArthavyavasthApanam` | **6** | 13 | 18 |
| `svaprakASatvaparaprakASavyavacCedaH` | **7** | 15 | 22 |
| `svAtantryAbhAvasamucchinnakartRtvanirAsaH` | **8** | 19 | 25 |

### Visual Comparison

**Input:** *Independent-knowledge-direct-realization-capacity*

* **Panini:** `▁nirapekza` | `jYAna` | `sAkzAtkAra` | `sAman` | `arthy` | `am` (6 meaningful roots)
* **Sanskrit-BERT:** `nirape` | `##k` | `##z` | `##a` | `##jya` | `##nas`... (14 noise fragments)

## πŸ“‹ Use Cases

- πŸ” **Sanskrit semantic search**
- πŸ“– **QA over philosophical texts** (Vedanta, Nyaya, etc.)
- πŸ“œ **Long-form verse processing** (epics, puranas)
- πŸ€– **Training Sanskrit LLMs** with cleaner token streams
- πŸ”¬ **Linguistics research** & morphological analysis

## πŸ› οΈ Technical Details

* **Architecture:** Recursive Descent Splitter + Kosha (Dictionary) Lookup.
* **Vocab Size:** 128,000.
* **Fallback:** Deterministic fallback: character-level only when grammar fails
## πŸ“œ Citation

```bibtex

@misc{panini2025,

  author = {ArthaLabs},

  title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},

  year = {2025},

  publisher = {Hugging Face},

  howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}

}

```

## License

Apache 2.0