🛡️ SindhiNLTK v1.0

A Morphology-Aware, Neural-Hybrid NLP Toolkit for Sindhi.

Developed by Aakash Meghwar (Founder, Text Tech Solutions). SindhiNLTK is a high-performance framework designed to eliminate "Subword Shattering" and the "Token Tax" in Sindhi Language Processing.


📊 Evaluation & Benchmarks

The toolkit was refined and validated against a corpus of 43,784 Sindhi SFT (Supervised Fine-Tuning) instruction samples.

1. Tokenization Efficiency (Fertility Rate)

Fertility Rate (FR) measures the average tokens generated per word. A lower FR indicates higher semantic alignment and lower computational cost.

Model Avg. Fertility (Sindhi) Aspiration Integrity Morphology Aware?
mBERT (Google) 3.82 Low (Shatters) No
Llama-3 (Meta) 4.15 Medium No
GPT-4 (OpenAI) 3.50 Medium No
SindhiNLTK (Ours) 1.15 100% (Protected) Yes

2. Core Parameters

  • Fertility Rate: 1.15 (A ~230% improvement in efficiency over standard BPE).
  • Aspiration Integrity: 100% preservation of clusters (گھ، جھ، کھ، etc.) via the V3 Morphological Shield.
  • Context Optimization: By reducing token bloat, SindhiNLTK allows models to process significantly longer contexts within the same memory limits.

🚀 Key Features

  • Linguistic Shielding: A hybrid Regex-BPE architecture that prevents character-level fallback for the 52-letter Sindhi alphabet.
  • Instruction-Tuned Logic: Optimized for processing complex prompts and SFT datasets.
  • Neural Sentiment Brain: Context-aware sentiment classification using a transformer-based backbone.
  • Morphological Stemmer: Rule-based suffix stripping tailored for Sindhi noun and verb forms.

⚖️ License

Licensed under the MIT License.

👤 About the Author

Aakash Meghwar is a Computational Linguist specializing in the digital evolution of South Asian languages.

  • 🎓 M.S. in Applied Linguistics & Text Analytics (HSE, Russia - Graduating June 2026)
  • 🎓 B.S. in English Language & Literature (NUML, Islamabad)
  • 💡 Founder: Text Tech Solutions
  • ✍️ Published Researcher: Author of "Compact Transformer Models for Classical Urdu Poetry" (Corporum Journal).

🤝 Open for PhD Opportunities & Collaboration

I am actively seeking PhD opportunities and Research Collaborations in:

  • Low-Resource NLP: Efficient modeling for Sindhi, Urdu, and Siraiki.
  • Model Compression: Knowledge distillation and MiniLLMs for South Asian languages.
  • SindhiLM Evolution: Developing next-generation, morphology-aware language models.

🛠️ Core Projects & Research

  • SindhiFormer: A specialized transformer series for Sindhi script and syntax.
  • AuratMarch MiniLLM: Research into lightweight, socially-nuanced models for low-resource contexts.
  • Sindhi SFT Datasets: Development and curation of high-quality instruction-following data.
  • Morphology-Aware Tokenizers: Custom BPE engines designed for South Asian orthography.

Contact: aakashmeghwar01@gmail.com | LinkedIn


💻 Quick Start

from sindhinltk import SindhiNLP

nlp = SindhiNLP()
text = "سنڌي ٻولي تمام مٺي ۽ خوبصورت آهي"
result = nlp.process(test_text)

print(result['tokens']) # ['سنڌي', 'ٻولي', 'تمام', 'مٺي', '۽', 'خوبصورت', 'آهي']
print(result['sentiment']) # {'label': 'Positive', 'confidence': '54.11%'}

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support