| language: | |
| - sd | |
| license: mit | |
| tags: | |
| - sindhi | |
| - bert | |
| - masked-language-modeling | |
| - from-scratch | |
| # Sindhi-BERT-base | |
| First BERT-style model trained from scratch on Sindhi text. | |
| ## Training History | |
| | Session | Data | Epochs | PPL | Notes | | |
| |---|---|---|---|---| | |
| | S1 | 500K lines | 5 | 78.10 | from scratch | | |
| | S2 | 1.5M lines | 3 | 41.62 | continued | | |
| | S3 | 87M words | 2 | 28.46 | bf16, cosine LR | | |
| | S4 | 87M words | 3 | 35.16 | grouped context, MLM=0.20 | | |
| | S5 | 87M words | 2 | 29.67 | fine polish, MLM=0.15 | | |
| | S6r | 149M words | 2 | 31.66 | grouping=80, LR=5e-6 | | |
| ## Usage | |
| ```python | |
| from transformers import RobertaForMaskedLM | |
| import sentencepiece as spm, torch | |
| import torch.nn.functional as F | |
| from huggingface_hub import hf_hub_download | |
| REPO = "hellosindh/sindhi-bert-base" | |
| MASK_ID = 32000 | |
| BOS_ID = 2 | |
| EOS_ID = 3 | |
| model = RobertaForMaskedLM.from_pretrained(REPO) | |
| sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model") | |
| sp = spm.SentencePieceProcessor() | |
| sp.Load(sp_path) | |
| ``` | |