Update model card with Session 2 results
Browse files
README.md
CHANGED
|
@@ -11,14 +11,14 @@ tags:
|
|
| 11 |
|
| 12 |
# Sindhi-BERT-base
|
| 13 |
|
| 14 |
-
BERT-style model trained from scratch on Sindhi text.
|
| 15 |
|
| 16 |
## Training History
|
| 17 |
|
| 18 |
-
| Session | Data | Epochs | Perplexity |
|
| 19 |
-
|---|---|---|---|
|
| 20 |
-
| Session 1 | 500K lines | 5 | 78.10 |
|
| 21 |
-
| Session 2 |
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|
|
@@ -26,14 +26,58 @@ BERT-style model trained from scratch on Sindhi text.
|
|
| 26 |
|---|---|
|
| 27 |
| Architecture | RoBERTa-base |
|
| 28 |
| Vocabulary | 32,000 tokens (pure Sindhi BPE) |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
| Parameters | ~125M |
|
| 30 |
| Language | Sindhi (sd) |
|
| 31 |
| License | MIT |
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
## Roadmap
|
| 34 |
|
| 35 |
-
- [x]
|
| 36 |
-
- [x] Session
|
| 37 |
-
- [
|
|
|
|
|
|
|
| 38 |
- [ ] Spell checker fine-tuning
|
| 39 |
- [ ] Next word prediction
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
# Sindhi-BERT-base
|
| 13 |
|
| 14 |
+
The first BERT-style language model trained from scratch on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
|
| 15 |
|
| 16 |
## Training History
|
| 17 |
|
| 18 |
+
| Session | Data | Epochs | Perplexity | Time |
|
| 19 |
+
|---|---|---|---|---|
|
| 20 |
+
| Session 1 | 500K lines | 5 | 78.10 | 301 min |
|
| 21 |
+
| Session 2 | 1.5M lines | 3 | 41.62 | 359 min |
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|
|
|
|
| 26 |
|---|---|
|
| 27 |
| Architecture | RoBERTa-base |
|
| 28 |
| Vocabulary | 32,000 tokens (pure Sindhi BPE) |
|
| 29 |
+
| Hidden size | 768 |
|
| 30 |
+
| Layers | 12 |
|
| 31 |
+
| Attention heads | 12 |
|
| 32 |
+
| Max length | 512 tokens |
|
| 33 |
| Parameters | ~125M |
|
| 34 |
| Language | Sindhi (sd) |
|
| 35 |
| License | MIT |
|
| 36 |
|
| 37 |
+
## Fill-Mask Quality
|
| 38 |
+
|
| 39 |
+
| Session | Score |
|
| 40 |
+
|---|---|
|
| 41 |
+
| Session 1 | 50% (5/10) |
|
| 42 |
+
| Session 2 | 70% (7/10) |
|
| 43 |
+
|
| 44 |
+
## Fill-Mask Examples (Session 2)
|
| 45 |
+
|
| 46 |
+
| Input | Top Prediction | Confidence | Quality |
|
| 47 |
+
|---|---|---|---|
|
| 48 |
+
| پاڪستان ۾ سنڌي ___ گهڻي تعداد | ماڻهو (people) | 49.78% | Perfect |
|
| 49 |
+
| سنڌي ___ دنيا جي قديم ٻولين | ٻولي (language) | 22.25% | Perfect |
|
| 50 |
+
| شاهه لطيف سنڌي ___ جو وڏو شاعر | شاعريءَ | 22.22% | Perfect |
|
| 51 |
+
| استاد شاگردن کي ___ سيکاري | تعليم (education) | 10.61% | Good |
|
| 52 |
+
| ڪراچي سنڌ جو سڀ کان وڏو ___ | شهر (city) | 9.04% | Perfect |
|
| 53 |
+
| سنڌ جي ___ ڏاڍي پراڻي آهي | تاريخ (history) | 7.48% | Perfect |
|
| 54 |
+
| دنيا ___ گھڻي مصروف آھي | ۾ (in) | 38.99% | Perfect |
|
| 55 |
+
|
| 56 |
+
## Tokenizer
|
| 57 |
+
|
| 58 |
+
Custom Sindhi BPE tokenizer — every Sindhi word stays as ONE token:
|
| 59 |
+
|
| 60 |
+
Input : سنڌي ٻولي دنيا جي قديم
|
| 61 |
+
Tokens : ['سنڌي', 'ٻولي', 'دنيا', 'جي', 'قديم']
|
| 62 |
+
Count : 5 words = 5 tokens
|
| 63 |
+
|
| 64 |
+
## Comparison With Other Models
|
| 65 |
+
|
| 66 |
+
| Model | Type | Perplexity | Fill-mask |
|
| 67 |
+
|---|---|---|---|
|
| 68 |
+
| mBERT fine-tuned | Multilingual | 4.19 | Poor — predicts punctuation |
|
| 69 |
+
| XLM-R fine-tuned | Multilingual | 5.88 | 80% correct |
|
| 70 |
+
| Sindhi-BERT S1 | Sindhi only | 78.10 | 50% |
|
| 71 |
+
| Sindhi-BERT S2 | Sindhi only | 41.62 | 70% |
|
| 72 |
+
|
| 73 |
## Roadmap
|
| 74 |
|
| 75 |
+
- [x] Custom Sindhi BPE tokenizer (32K vocab)
|
| 76 |
+
- [x] Session 1 — 500K lines, 5 epochs, perplexity 78
|
| 77 |
+
- [x] Session 2 — 1.5M lines, 3 epochs, perplexity 41
|
| 78 |
+
- [ ] Session 3 — new data + full corpus
|
| 79 |
+
- [ ] Session 4 — lower LR, fine tuning
|
| 80 |
- [ ] Spell checker fine-tuning
|
| 81 |
- [ ] Next word prediction
|
| 82 |
+
- [ ] Sindhi chatbot
|
| 83 |
+
|