Sindhi-BERT-base

First BERT-style model trained from scratch on Sindhi text.

Training History

Session	Data	Epochs	PPL	Notes
S1	500K lines	5	78.10	from scratch
S2	1.5M lines	3	41.62	continued
S3	87M words	2	28.46	bf16, cosine LR
S4	87M words	3	35.16	grouped context, MLM=0.20
S5	87M words	2	29.67	fine polish, MLM=0.15
S6r	149M words	2	31.66	grouping=80, LR=5e-6

Usage

from transformers import RobertaForMaskedLM
import sentencepiece as spm, torch
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

REPO    = "hellosindh/sindhi-bert-base"
MASK_ID = 32000
BOS_ID  = 2
EOS_ID  = 3

model   = RobertaForMaskedLM.from_pretrained(REPO)
sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model")
sp      = spm.SentencePieceProcessor()
sp.Load(sp_path)

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support