🛡️ SindhiNLTK v1.0
A Morphology-Aware, Neural-Hybrid NLP Toolkit for Sindhi.
Developed by Aakash Meghwar (Founder, Text Tech Solutions). SindhiNLTK is a high-performance framework designed to eliminate "Subword Shattering" and the "Token Tax" in Sindhi Language Processing.
📊 Evaluation & Benchmarks
The toolkit was refined and validated against a corpus of 43,784 Sindhi SFT (Supervised Fine-Tuning) instruction samples.
1. Tokenization Efficiency (Fertility Rate)
Fertility Rate (FR) measures the average tokens generated per word. A lower FR indicates higher semantic alignment and lower computational cost.
| Model | Avg. Fertility (Sindhi) | Aspiration Integrity | Morphology Aware? |
|---|---|---|---|
| mBERT (Google) | 3.82 | Low (Shatters) | No |
| Llama-3 (Meta) | 4.15 | Medium | No |
| GPT-4 (OpenAI) | 3.50 | Medium | No |
| SindhiNLTK (Ours) | 1.15 | 100% (Protected) | Yes |
2. Core Parameters
- Fertility Rate: 1.15 (A ~230% improvement in efficiency over standard BPE).
- Aspiration Integrity: 100% preservation of clusters (گھ، جھ، کھ، etc.) via the V3 Morphological Shield.
- Context Optimization: By reducing token bloat, SindhiNLTK allows models to process significantly longer contexts within the same memory limits.
🚀 Key Features
- Linguistic Shielding: A hybrid Regex-BPE architecture that prevents character-level fallback for the 52-letter Sindhi alphabet.
- Instruction-Tuned Logic: Optimized for processing complex prompts and SFT datasets.
- Neural Sentiment Brain: Context-aware sentiment classification using a transformer-based backbone.
- Morphological Stemmer: Rule-based suffix stripping tailored for Sindhi noun and verb forms.
⚖️ License
Licensed under the MIT License.
👤 About the Author
Aakash Meghwar is a Computational Linguist specializing in the digital evolution of South Asian languages.
- 🎓 M.S. in Applied Linguistics & Text Analytics (HSE, Russia - Graduating June 2026)
- 🎓 B.S. in English Language & Literature (NUML, Islamabad)
- 💡 Founder: Text Tech Solutions
- ✍️ Published Researcher: Author of "Compact Transformer Models for Classical Urdu Poetry" (Corporum Journal).
🤝 Open for PhD Opportunities & Collaboration
I am actively seeking PhD opportunities and Research Collaborations in:
- Low-Resource NLP: Efficient modeling for Sindhi, Urdu, and Siraiki.
- Model Compression: Knowledge distillation and MiniLLMs for South Asian languages.
- SindhiLM Evolution: Developing next-generation, morphology-aware language models.
🛠️ Core Projects & Research
- SindhiFormer: A specialized transformer series for Sindhi script and syntax.
- AuratMarch MiniLLM: Research into lightweight, socially-nuanced models for low-resource contexts.
- Sindhi SFT Datasets: Development and curation of high-quality instruction-following data.
- Morphology-Aware Tokenizers: Custom BPE engines designed for South Asian orthography.
Contact: aakashmeghwar01@gmail.com | LinkedIn
💻 Quick Start
from sindhinltk import SindhiNLP
nlp = SindhiNLP()
text = "سنڌي ٻولي تمام مٺي ۽ خوبصورت آهي"
result = nlp.process(test_text)
print(result['tokens']) # ['سنڌي', 'ٻولي', 'تمام', 'مٺي', '۽', 'خوبصورت', 'آهي']
print(result['sentiment']) # {'label': 'Positive', 'confidence': '54.11%'}