hellosindh commited on
Commit
b9a1b96
·
verified ·
1 Parent(s): 5308429

Update model card with Session 2 results

Browse files
Files changed (1) hide show
  1. README.md +52 -8
README.md CHANGED
@@ -11,14 +11,14 @@ tags:
11
 
12
  # Sindhi-BERT-base
13
 
14
- BERT-style model trained from scratch on Sindhi text.
15
 
16
  ## Training History
17
 
18
- | Session | Data | Epochs | Perplexity |
19
- |---|---|---|---|
20
- | Session 1 | 500K lines | 5 | 78.10 |
21
- | Session 2 | 2.1M lines | 3 | 41.62 |
22
 
23
  ## Model Details
24
 
@@ -26,14 +26,58 @@ BERT-style model trained from scratch on Sindhi text.
26
  |---|---|
27
  | Architecture | RoBERTa-base |
28
  | Vocabulary | 32,000 tokens (pure Sindhi BPE) |
 
 
 
 
29
  | Parameters | ~125M |
30
  | Language | Sindhi (sd) |
31
  | License | MIT |
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ## Roadmap
34
 
35
- - [x] Session 1 500K lines, 5 epochs
36
- - [x] Session 22.1M lines, 3 more epochs
37
- - [ ] Session 3more epochs
 
 
38
  - [ ] Spell checker fine-tuning
39
  - [ ] Next word prediction
 
 
 
11
 
12
  # Sindhi-BERT-base
13
 
14
+ The first BERT-style language model trained from scratch on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
15
 
16
  ## Training History
17
 
18
+ | Session | Data | Epochs | Perplexity | Time |
19
+ |---|---|---|---|---|
20
+ | Session 1 | 500K lines | 5 | 78.10 | 301 min |
21
+ | Session 2 | 1.5M lines | 3 | 41.62 | 359 min |
22
 
23
  ## Model Details
24
 
 
26
  |---|---|
27
  | Architecture | RoBERTa-base |
28
  | Vocabulary | 32,000 tokens (pure Sindhi BPE) |
29
+ | Hidden size | 768 |
30
+ | Layers | 12 |
31
+ | Attention heads | 12 |
32
+ | Max length | 512 tokens |
33
  | Parameters | ~125M |
34
  | Language | Sindhi (sd) |
35
  | License | MIT |
36
 
37
+ ## Fill-Mask Quality
38
+
39
+ | Session | Score |
40
+ |---|---|
41
+ | Session 1 | 50% (5/10) |
42
+ | Session 2 | 70% (7/10) |
43
+
44
+ ## Fill-Mask Examples (Session 2)
45
+
46
+ | Input | Top Prediction | Confidence | Quality |
47
+ |---|---|---|---|
48
+ | پاڪستان ۾ سنڌي ___ گهڻي تعداد | ماڻهو (people) | 49.78% | Perfect |
49
+ | سنڌي ___ دنيا جي قديم ٻولين | ٻولي (language) | 22.25% | Perfect |
50
+ | شاهه لطيف سنڌي ___ جو وڏو شاعر | شاعريءَ | 22.22% | Perfect |
51
+ | استاد شاگردن کي ___ سيکاري | تعليم (education) | 10.61% | Good |
52
+ | ڪراچي سنڌ جو سڀ کان وڏو ___ | شهر (city) | 9.04% | Perfect |
53
+ | سنڌ جي ___ ڏاڍي پراڻي آهي | تاريخ (history) | 7.48% | Perfect |
54
+ | دنيا ___ گھڻي مصروف آھي | ۾ (in) | 38.99% | Perfect |
55
+
56
+ ## Tokenizer
57
+
58
+ Custom Sindhi BPE tokenizer — every Sindhi word stays as ONE token:
59
+
60
+ Input : سنڌي ٻولي دنيا جي قديم
61
+ Tokens : ['سنڌي', 'ٻولي', 'دنيا', 'جي', 'قديم']
62
+ Count : 5 words = 5 tokens
63
+
64
+ ## Comparison With Other Models
65
+
66
+ | Model | Type | Perplexity | Fill-mask |
67
+ |---|---|---|---|
68
+ | mBERT fine-tuned | Multilingual | 4.19 | Poor — predicts punctuation |
69
+ | XLM-R fine-tuned | Multilingual | 5.88 | 80% correct |
70
+ | Sindhi-BERT S1 | Sindhi only | 78.10 | 50% |
71
+ | Sindhi-BERT S2 | Sindhi only | 41.62 | 70% |
72
+
73
  ## Roadmap
74
 
75
+ - [x] Custom Sindhi BPE tokenizer (32K vocab)
76
+ - [x] Session 1500K lines, 5 epochs, perplexity 78
77
+ - [x] Session 21.5M lines, 3 epochs, perplexity 41
78
+ - [ ] Session 3 — new data + full corpus
79
+ - [ ] Session 4 — lower LR, fine tuning
80
  - [ ] Spell checker fine-tuning
81
  - [ ] Next word prediction
82
+ - [ ] Sindhi chatbot
83
+