hellosindh
/

sindhi-bert-base

@@ -11,14 +11,14 @@ tags:
 # Sindhi-BERT-base
-BERT-style model trained from scratch on Sindhi text.
 ## Training History
-| Session | Data | Epochs | Perplexity |
-|---|---|---|---|
-| Session 1 | 500K lines | 5 | 78.10 |
-| Session 2 | 2.1M lines | 3 | 41.62 |
 ## Model Details
@@ -26,14 +26,58 @@ BERT-style model trained from scratch on Sindhi text.
 |---|---|
 | Architecture | RoBERTa-base |
 | Vocabulary | 32,000 tokens (pure Sindhi BPE) |
 | Parameters | ~125M |
 | Language | Sindhi (sd) |
 | License | MIT |
 ## Roadmap
-- [x] Session 1 — 500K lines, 5 epochs
-- [x] Session 2 — 2.1M lines, 3 more epochs
-- [ ] Session 3 — more epochs
 - [ ] Spell checker fine-tuning
 - [ ] Next word prediction

 # Sindhi-BERT-base
+The first BERT-style language model trained from scratch on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
 ## Training History
+| Session | Data | Epochs | Perplexity | Time |
+|---|---|---|---|---|
+| Session 1 | 500K lines | 5 | 78.10 | 301 min |
+| Session 2 | 1.5M lines | 3 | 41.62 | 359 min |
 ## Model Details
 |---|---|
 | Architecture | RoBERTa-base |
 | Vocabulary | 32,000 tokens (pure Sindhi BPE) |
+| Hidden size | 768 |
+| Layers | 12 |
+| Attention heads | 12 |
+| Max length | 512 tokens |
 | Parameters | ~125M |
 | Language | Sindhi (sd) |
 | License | MIT |
+## Fill-Mask Quality
+| Session | Score |
+|---|---|
+| Session 1 | 50% (5/10) |
+| Session 2 | 70% (7/10) |
+## Fill-Mask Examples (Session 2)
+| Input | Top Prediction | Confidence | Quality |
+|---|---|---|---|
+| پاڪستان ۾ سنڌي ___ گهڻي تعداد | ماڻهو (people) | 49.78% | Perfect |
+| سنڌي ___ دنيا جي قديم ٻولين | ٻولي (language) | 22.25% | Perfect |
+| شاهه لطيف سنڌي ___ جو وڏو شاعر | شاعريءَ | 22.22% | Perfect |
+| استاد شاگردن کي ___ سيکاري | تعليم (education) | 10.61% | Good |
+| ڪراچي سنڌ جو سڀ کان وڏو ___ | شهر (city) | 9.04% | Perfect |
+| سنڌ جي ___ ڏاڍي پراڻي آهي | تاريخ (history) | 7.48% | Perfect |
+| دنيا ___ گھڻي مصروف آھي | ۾ (in) | 38.99% | Perfect |
+## Tokenizer
+Custom Sindhi BPE tokenizer — every Sindhi word stays as ONE token:
+Input  : سنڌي ٻولي دنيا جي قديم
+Tokens : ['سنڌي', 'ٻولي', 'دنيا', 'جي', 'قديم']
+Count  : 5 words = 5 tokens
+## Comparison With Other Models
+| Model | Type | Perplexity | Fill-mask |
+|---|---|---|---|
+| mBERT fine-tuned | Multilingual | 4.19 | Poor — predicts punctuation |
+| XLM-R fine-tuned | Multilingual | 5.88 | 80% correct |
+| Sindhi-BERT S1 | Sindhi only | 78.10 | 50% |
+| Sindhi-BERT S2 | Sindhi only | 41.62 | 70% |
 ## Roadmap
+- [x] Custom Sindhi BPE tokenizer (32K vocab)
+- [x] Session 1 — 500K lines, 5 epochs, perplexity 78
+- [x] Session 2 — 1.5M lines, 3 epochs, perplexity 41
+- [ ] Session 3 — new data + full corpus
+- [ ] Session 4 — lower LR, fine tuning
 - [ ] Spell checker fine-tuning
 - [ ] Next word prediction
+- [ ] Sindhi chatbot