Panini Tokenizer
The first grammar-first Sanskrit tokenizer based on PΔαΉinian morphological analysis.
Why it matters: Fewer tokens = more usable context per input = better learning & longer text coverage.
π¨ The Problem
Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model Sandhi(phonetic fusion).
- Standard Models (BERT/Qwen): fracture complex words into phonetic noise (
##k,##z,##ab). - Panini Tokenizer: uses recursive morphological parsing to recover the original semantic roots (
nirapekza+jYAna).
β‘ Key Features
- π€ Vocab: 128k dictionary-backed tokens (Monier-Williams).
- π Sandhi Reversal: Automatically splits fused compounds (e.g.,
tβd,iβy). - π§© Semantic Atomicism: Preserves complex philosophical concepts as single tokens. This aligns token boundaries with linguistic meaning, reducing gradient noise during training.
- π Efficiency: Reduces token count by 2-4x compared to multilingual models.
π Quick Start
No custom installation required. Use directly with Hugging Face transformers:
Note: The model expects SLP1 transliteration (e.g., vidyA), not Devanagari.
from transformers import AutoTokenizer
# Load with trust_remote_code=True because of custom logic
tokenizer = AutoTokenizer.from_pretrained(
"ArthaLabs/panini-tokenizer",
trust_remote_code=True
)
# Tokenize complex Sandhi compounds (SLP1 input)
text = "nirapekzajYAnasAkzAtkArasAmarthyam"
tokens = tokenizer.tokenize(text)
print(tokens)
π Benchmarks: The "Context Dividend"
By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence length, effectively tripling the context window for downstream tasks.
| Input Compound | Panini (Ours) | Google MuRIL | Qwen2 |
|---|---|---|---|
nirapekzajYAnasAkzAtkArasAmarthyam |
6 | 18 | 25 |
tadekaniScitArthavyavasthApanam |
6 | 13 | 18 |
svaprakASatvaparaprakASavyavacCedaH |
7 | 15 | 22 |
svAtantryAbhAvasamucchinnakartRtvanirAsaH |
8 | 19 | 25 |
Visual Comparison
Input: Independent-knowledge-direct-realization-capacity
- Panini:
βnirapekza|jYAna|sAkzAtkAra|sAman|arthy|am(6 meaningful roots) - Sanskrit-BERT:
nirape|##k|##z|##a|##jya|##nas... (14 noise fragments)
π Use Cases
- π Sanskrit semantic search
- π QA over philosophical texts (Vedanta, Nyaya, etc.)
- π Long-form verse processing (epics, puranas)
- π€ Training Sanskrit LLMs with cleaner token streams
- π¬ Linguistics research & morphological analysis
π οΈ Technical Details
- Architecture: Recursive Descent Splitter + Kosha (Dictionary) Lookup.
- Vocab Size: 128,000.
- Fallback: Deterministic fallback: character-level only when grammar fails
π Citation
@misc{panini2025,
author = {ArthaLabs},
title = {Panini Tokenizer: Grammar-First Sanskrit Tokenization},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ArthaLabs/panini-tokenizer}}
}
License
Apache 2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support