Panini Tokenizer

The first grammar-first Sanskrit tokenizer based on Pāṇinian morphological analysis.

Demo

Why it matters: Fewer tokens = more usable context per input = better learning & longer text coverage.

🚨 The Problem

Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model Sandhi(phonetic fusion).

  • Standard Models (BERT/Qwen): fracture complex words into phonetic noise (##k, ##z, ##ab).
  • Panini Tokenizer: uses recursive morphological parsing to recover the original semantic roots (nirapekza + jYAna).

⚑ Key Features

  • πŸ”€ Vocab: 128k dictionary-backed tokens (Monier-Williams).
  • πŸ”„ Sandhi Reversal: Automatically splits fused compounds (e.g., t β†’ d, i β†’ y).
  • 🧩 Semantic Atomicism: Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
  • πŸ“‰ Efficiency: Reduces token count by 2-4x compared to multilingual models.

πŸš€ Quick Start

No custom installation required. Use directly with Hugging Face transformers: Note: The model expects SLP1 transliteration (e.g., vidyA), not Devanagari.

from transformers import AutoTokenizer

# Load with trust_remote_code=True because of custom logic
tokenizer = AutoTokenizer.from_pretrained(
    "ArthaLabs/panini-tokenizer",
    trust_remote_code=True
)

# Tokenize complex Sandhi compounds (SLP1 input)
text = "nirapekzajYAnasAkzAtkArasAmarthyam"
tokens = tokenizer.tokenize(text)

print(tokens)

πŸ“Š Benchmarks: The "Context Dividend"

By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively tripling the context window for downstream tasks.

Input Compound Panini (Ours) Google MuRIL Qwen2
nirapekzajYAnasAkzAtkArasAmarthyam 6 18 25
tadekaniScitArthavyavasthApanam 6 13 18
svaprakASatvaparaprakASavyavacCedaH 7 15 22
svAtantryAbhAvasamucchinnakartRtvanirAsaH 8 19 25

Visual Comparison

Input: Independent-knowledge-direct-realization-capacity

  • Panini: ▁nirapekza | jYAna | sAkzAtkAra | sAman | arthy | am (6 meaningful roots)
  • Sanskrit-BERT: nirape | ##k | ##z | ##a | ##jya | ##nas... (14 noise fragments)

πŸ“‹ Use Cases

  • πŸ” Sanskrit semantic search
  • πŸ“– QA over philosophical texts (Vedanta, Nyaya, etc.)
  • πŸ“œ Long-form verse processing (epics, puranas)
  • πŸ€– Training Sanskrit LLMs with cleaner token streams
  • πŸ”¬ Linguistics research & morphological analysis

πŸ› οΈ Technical Details

  • Architecture: Recursive Descent Splitter + Kosha (Dictionary) Lookup.
  • Vocab Size: 128,000.
  • Fallback: Deterministic fallback: character-level only when grammar fails

πŸ“œ Citation

@misc{panini2025,
  author = {ArthaLabs},
  title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using ArthaLabs/panini-tokenizer 1