Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -22,6 +22,8 @@ tags:
|
|
| 22 |
|
| 23 |
[](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)
|
| 24 |
|
|
|
|
|
|
|
| 25 |
## 🚨 The Problem
|
| 26 |
|
| 27 |
Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model **Sandhi**(phonetic fusion).
|
|
@@ -74,6 +76,14 @@ By strictly adhering to grammar, Panini Tokenizer drastically reduces sequence l
|
|
| 74 |
* **Panini:** `▁nirapekza` | `jYAna` | `sAkzAtkAra` | `sAman` | `arthy` | `am` (6 meaningful roots)
|
| 75 |
* **Sanskrit-BERT:** `nirape` | `##k` | `##z` | `##a` | `##jya` | `##nas`... (14 noise fragments)
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
## 🛠️ Technical Details
|
| 78 |
|
| 79 |
* **Architecture:** Recursive Descent Splitter + Kosha (Dictionary) Lookup.
|
|
|
|
| 22 |
|
| 23 |
[](https://huggingface.co/spaces/ArthaLabs/panini-tokenizer-demo)
|
| 24 |
|
| 25 |
+
> **Why it matters:** *Fewer tokens = more usable context per input = better learning & longer text coverage.*
|
| 26 |
+
|
| 27 |
## 🚨 The Problem
|
| 28 |
|
| 29 |
Statistical tokenizers (BPE/WordPiece) systematically underperform on Sanskrit because they do not model **Sandhi**(phonetic fusion).
|
|
|
|
| 76 |
* **Panini:** `▁nirapekza` | `jYAna` | `sAkzAtkAra` | `sAman` | `arthy` | `am` (6 meaningful roots)
|
| 77 |
* **Sanskrit-BERT:** `nirape` | `##k` | `##z` | `##a` | `##jya` | `##nas`... (14 noise fragments)
|
| 78 |
|
| 79 |
+
## 📋 Use Cases
|
| 80 |
+
|
| 81 |
+
- 🔍 **Sanskrit semantic search**
|
| 82 |
+
- 📖 **QA over philosophical texts** (Vedanta, Nyaya, etc.)
|
| 83 |
+
- 📜 **Long-form verse processing** (epics, puranas)
|
| 84 |
+
- 🤖 **Training Sanskrit LLMs** with cleaner token streams
|
| 85 |
+
- 🔬 **Linguistics research** & morphological analysis
|
| 86 |
+
|
| 87 |
## 🛠️ Technical Details
|
| 88 |
|
| 89 |
* **Architecture:** Recursive Descent Splitter + Kosha (Dictionary) Lookup.
|