--- language: - sd license: mit tags: - sindhi - bert - masked-language-modeling - from-scratch --- # Sindhi-BERT-base First BERT-style model trained from scratch on Sindhi text. ## Training History | Session | Data | Epochs | PPL | Notes | |---|---|---|---|---| | S1 | 500K lines | 5 | 78.10 | from scratch | | S2 | 1.5M lines | 3 | 41.62 | continued | | S3 | 87M words | 2 | 28.46 | bf16, cosine LR | | S4 | 87M words | 3 | 35.16 | grouped context, MLM=0.20 | | S5 | 87M words | 2 | 29.67 | fine polish, MLM=0.15 | | S6r | 149M words | 2 | 31.66 | grouping=80, LR=5e-6 | ## Usage ```python from transformers import RobertaForMaskedLM import sentencepiece as spm, torch import torch.nn.functional as F from huggingface_hub import hf_hub_download REPO = "hellosindh/sindhi-bert-base" MASK_ID = 32000 BOS_ID = 2 EOS_ID = 3 model = RobertaForMaskedLM.from_pretrained(REPO) sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model") sp = spm.SentencePieceProcessor() sp.Load(sp_path) ```