File size: 1,016 Bytes
e99b148 feaec69 e0f6e92 feaec69 5308429 e0f6e92 e085157 feaec69 e0f6e92 feaec69 e085157 feaec69 e0f6e92 feaec69 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | ---
language:
- sd
license: mit
tags:
- sindhi
- bert
- masked-language-modeling
- from-scratch
---
# Sindhi-BERT-base
First BERT-style model trained from scratch on Sindhi text.
## Training History
| Session | Data | Epochs | PPL | Notes |
|---|---|---|---|---|
| S1 | 500K lines | 5 | 78.10 | from scratch |
| S2 | 1.5M lines | 3 | 41.62 | continued |
| S3 | 87M words | 2 | 28.46 | bf16, cosine LR |
| S4 | 87M words | 3 | 35.16 | grouped context, MLM=0.20 |
| S5 | 87M words | 2 | 29.67 | fine polish, MLM=0.15 |
| S6r | 149M words | 2 | 31.66 | grouping=80, LR=5e-6 |
## Usage
```python
from transformers import RobertaForMaskedLM
import sentencepiece as spm, torch
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
REPO = "hellosindh/sindhi-bert-base"
MASK_ID = 32000
BOS_ID = 2
EOS_ID = 3
model = RobertaForMaskedLM.from_pretrained(REPO)
sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model")
sp = spm.SentencePieceProcessor()
sp.Load(sp_path)
```
|