Panini Tokenizer

The first grammar-first Sanskrit tokenizer based on Pāṇinian morphological analysis.

Why it matters: Fewer tokens = more usable context per input = better learning & longer text coverage.

🚨 The Problem

Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model Sandhi(phonetic fusion).

Standard Models (BERT/Qwen): fracture complex words into phonetic noise (##k, ##z, ##ab).
Panini Tokenizer: uses recursive morphological parsing to recover the original semantic roots (nirapekza + jYAna).

⚡ Key Features

🔤 Vocab: 128k dictionary-backed tokens (Monier-Williams).
🔄 Sandhi Reversal: Automatically splits fused compounds (e.g., t → d, i → y).
🧩 Semantic Atomicism: Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
📉 Efficiency: Reduces token count by 2-4x compared to multilingual models.

🚀 Quick Start

No custom installation required. Use directly with Hugging Face transformers: Note: The model expects SLP1 transliteration (e.g., vidyA), not Devanagari.

from transformers import AutoTokenizer

# Load with trust_remote_code=True because of custom logic
tokenizer = AutoTokenizer.from_pretrained(
    "ArthaLabs/panini-tokenizer",
    trust_remote_code=True
)

# Tokenize complex Sandhi compounds (SLP1 input)
text = "nirapekzajYAnasAkzAtkArasAmarthyam"
tokens = tokenizer.tokenize(text)

print(tokens)

📊 Benchmarks: The "Context Dividend"

By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively tripling the context window for downstream tasks.

Input Compound	Panini (Ours)	Google MuRIL	Qwen2
`nirapekzajYAnasAkzAtkArasAmarthyam`	6	18	25
`tadekaniScitArthavyavasthApanam`	6	13	18
`svaprakASatvaparaprakASavyavacCedaH`	7	15	22
`svAtantryAbhAvasamucchinnakartRtvanirAsaH`	8	19	25

Visual Comparison

Input: Independent-knowledge-direct-realization-capacity

Panini: ▁nirapekza | jYAna | sAkzAtkAra | sAman | arthy | am (6 meaningful roots)
Sanskrit-BERT: nirape | ##k | ##z | ##a | ##jya | ##nas... (14 noise fragments)

📋 Use Cases

🔍 Sanskrit semantic search
📖 QA over philosophical texts (Vedanta, Nyaya, etc.)
📜 Long-form verse processing (epics, puranas)
🤖 Training Sanskrit LLMs with cleaner token streams
🔬 Linguistics research & morphological analysis

🛠️ Technical Details

Architecture: Recursive Descent Splitter + Kosha (Dictionary) Lookup.
Vocab Size: 128,000.
Fallback: Deterministic fallback: character-level only when grammar fails

📜 Citation

@misc{panini2025,
  author = {ArthaLabs},
  title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ArthaLabs
/

panini-tokenizer