hellosindh commited on
Commit
a8234de
·
verified ·
1 Parent(s): ec371f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -25
README.md CHANGED
@@ -70,27 +70,6 @@ Tested on 10 Sindhi sentences after 5 epochs of training:
70
  Overall: 50% top-1 accuracy after 5 epochs on 500K sentences.
71
  Results improve significantly with more training.
72
 
73
- ## Comparison With Other Models
74
-
75
- | Model | Type | Perplexity | Fill-mask Quality |
76
- |---|---|---|---|
77
- | mBERT fine-tuned | Multilingual | 4.19 | Poor — predicts punctuation |
78
- | XLM-R fine-tuned | Multilingual | 5.88 | Good — 80% correct |
79
- | Sindhi-BERT scratch | Sindhi only | 78.10 | 50% — still improving |
80
-
81
- Note: Perplexity is not directly comparable between from-scratch and fine-tuned models. SindhiBERT starts from zero knowledge while mBERT/XLM-R start from pre-trained multilingual weights. SindhiBERT predictions are always real Sindhi words — never punctuation.
82
-
83
- ## Roadmap
84
-
85
- - [x] Train custom Sindhi BPE tokenizer (32K vocab)
86
- - [x] Session 1 — 500K lines, 5 epochs, A100
87
- - [ ] Session 2 — full corpus 2.1M lines
88
- - [ ] Session 3 — more epochs, lower learning rate
89
- - [ ] Fine-tune for spell checking
90
- - [ ] Fine-tune for next word prediction
91
- - [ ] Fine-tune for named entity recognition
92
- - [ ] Sindhi chatbot
93
-
94
  ## Citation
95
 
96
  If you use this model please cite:
@@ -99,7 +78,3 @@ sindhibert2026,
99
  title = Sindhi-BERT: A Sindhi Language Model Trained From Scratch,
100
  year = 2026,
101
  url = https://huggingface.co/hellosindh/sindhi-bert-base
102
-
103
- ## About
104
-
105
- This model is part of a larger effort to build complete NLP tools for the Sindhi language — one of the oldest languages in the world with over 30 million speakers across Pakistan and India.
 
70
  Overall: 50% top-1 accuracy after 5 epochs on 500K sentences.
71
  Results improve significantly with more training.
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## Citation
74
 
75
  If you use this model please cite:
 
78
  title = Sindhi-BERT: A Sindhi Language Model Trained From Scratch,
79
  year = 2026,
80
  url = https://huggingface.co/hellosindh/sindhi-bert-base