Sindhi-BERT-base

First BERT-style model trained from scratch on Sindhi text.

Training History

Session Data Epochs PPL Notes
S1 500K lines 5 78.10 from scratch
S2 1.5M lines 3 41.62 continued
S3 87M words 2 28.46 bf16, cosine LR
S4 87M words 3 35.16 grouped context, MLM=0.20
S5 87M words 2 29.67 fine polish, MLM=0.15
S6r 149M words 2 31.66 grouping=80, LR=5e-6

Usage

from transformers import RobertaForMaskedLM
import sentencepiece as spm, torch
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

REPO    = "hellosindh/sindhi-bert-base"
MASK_ID = 32000
BOS_ID  = 2
EOS_ID  = 3

model   = RobertaForMaskedLM.from_pretrained(REPO)
sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model")
sp      = spm.SentencePieceProcessor()
sp.Load(sp_path)
Downloads last month
246
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support