Update README.md
Browse files
README.md
CHANGED
|
@@ -70,27 +70,6 @@ Tested on 10 Sindhi sentences after 5 epochs of training:
|
|
| 70 |
Overall: 50% top-1 accuracy after 5 epochs on 500K sentences.
|
| 71 |
Results improve significantly with more training.
|
| 72 |
|
| 73 |
-
## Comparison With Other Models
|
| 74 |
-
|
| 75 |
-
| Model | Type | Perplexity | Fill-mask Quality |
|
| 76 |
-
|---|---|---|---|
|
| 77 |
-
| mBERT fine-tuned | Multilingual | 4.19 | Poor — predicts punctuation |
|
| 78 |
-
| XLM-R fine-tuned | Multilingual | 5.88 | Good — 80% correct |
|
| 79 |
-
| Sindhi-BERT scratch | Sindhi only | 78.10 | 50% — still improving |
|
| 80 |
-
|
| 81 |
-
Note: Perplexity is not directly comparable between from-scratch and fine-tuned models. SindhiBERT starts from zero knowledge while mBERT/XLM-R start from pre-trained multilingual weights. SindhiBERT predictions are always real Sindhi words — never punctuation.
|
| 82 |
-
|
| 83 |
-
## Roadmap
|
| 84 |
-
|
| 85 |
-
- [x] Train custom Sindhi BPE tokenizer (32K vocab)
|
| 86 |
-
- [x] Session 1 — 500K lines, 5 epochs, A100
|
| 87 |
-
- [ ] Session 2 — full corpus 2.1M lines
|
| 88 |
-
- [ ] Session 3 — more epochs, lower learning rate
|
| 89 |
-
- [ ] Fine-tune for spell checking
|
| 90 |
-
- [ ] Fine-tune for next word prediction
|
| 91 |
-
- [ ] Fine-tune for named entity recognition
|
| 92 |
-
- [ ] Sindhi chatbot
|
| 93 |
-
|
| 94 |
## Citation
|
| 95 |
|
| 96 |
If you use this model please cite:
|
|
@@ -99,7 +78,3 @@ sindhibert2026,
|
|
| 99 |
title = Sindhi-BERT: A Sindhi Language Model Trained From Scratch,
|
| 100 |
year = 2026,
|
| 101 |
url = https://huggingface.co/hellosindh/sindhi-bert-base
|
| 102 |
-
|
| 103 |
-
## About
|
| 104 |
-
|
| 105 |
-
This model is part of a larger effort to build complete NLP tools for the Sindhi language — one of the oldest languages in the world with over 30 million speakers across Pakistan and India.
|
|
|
|
| 70 |
Overall: 50% top-1 accuracy after 5 epochs on 500K sentences.
|
| 71 |
Results improve significantly with more training.
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
## Citation
|
| 74 |
|
| 75 |
If you use this model please cite:
|
|
|
|
| 78 |
title = Sindhi-BERT: A Sindhi Language Model Trained From Scratch,
|
| 79 |
year = 2026,
|
| 80 |
url = https://huggingface.co/hellosindh/sindhi-bert-base
|
|
|
|
|
|
|
|
|
|
|
|