File size: 1,016 Bytes
e99b148
 
 
 
 
 
 
 
 
 
 
feaec69
 
e0f6e92
feaec69
5308429
 
e0f6e92
 
e085157
 
 
 
 
 
feaec69
 
 
 
e0f6e92
 
feaec69
 
 
e085157
feaec69
 
 
 
e0f6e92
 
 
 
feaec69
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
language:
- sd
license: mit
tags:
- sindhi
- bert
- masked-language-modeling
- from-scratch
---

# Sindhi-BERT-base

First BERT-style model trained from scratch on Sindhi text.

## Training History

| Session | Data | Epochs | PPL | Notes |
|---|---|---|---|---|
| S1  | 500K lines  | 5 | 78.10 | from scratch |
| S2  | 1.5M lines  | 3 | 41.62 | continued |
| S3  | 87M words   | 2 | 28.46 | bf16, cosine LR |
| S4  | 87M words   | 3 | 35.16 | grouped context, MLM=0.20 |
| S5  | 87M words   | 2 | 29.67 | fine polish, MLM=0.15 |
| S6r | 149M words  | 2 | 31.66 | grouping=80, LR=5e-6 |

## Usage

```python
from transformers import RobertaForMaskedLM
import sentencepiece as spm, torch
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

REPO    = "hellosindh/sindhi-bert-base"
MASK_ID = 32000
BOS_ID  = 2
EOS_ID  = 3

model   = RobertaForMaskedLM.from_pretrained(REPO)
sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model")
sp      = spm.SentencePieceProcessor()
sp.Load(sp_path)
```