hellosindh
/

sindhi-bert-base

masked-language-modeling

Model card Files Files and versions

sindhi-bert-base / README.md

hellosindh's picture

Upload folder using huggingface_hub

e085157 verified 10 days ago

|

history blame contribute delete

1.02 kB

	---
	language:
	- sd
	license: mit
	tags:
	- sindhi
	- bert
	- masked-language-modeling
	- from-scratch
	---

	# Sindhi-BERT-base

	First BERT-style model trained from scratch on Sindhi text.

	## Training History

	\| Session \| Data \| Epochs \| PPL \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| S1 \| 500K lines \| 5 \| 78.10 \| from scratch \|
	\| S2 \| 1.5M lines \| 3 \| 41.62 \| continued \|
	\| S3 \| 87M words \| 2 \| 28.46 \| bf16, cosine LR \|
	\| S4 \| 87M words \| 3 \| 35.16 \| grouped context, MLM=0.20 \|
	\| S5 \| 87M words \| 2 \| 29.67 \| fine polish, MLM=0.15 \|
	\| S6r \| 149M words \| 2 \| 31.66 \| grouping=80, LR=5e-6 \|

	## Usage

	```python
	from transformers import RobertaForMaskedLM
	import sentencepiece as spm, torch
	import torch.nn.functional as F
	from huggingface_hub import hf_hub_download

	REPO = "hellosindh/sindhi-bert-base"
	MASK_ID = 32000
	BOS_ID = 2
	EOS_ID = 3

	model = RobertaForMaskedLM.from_pretrained(REPO)
	sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model")
	sp = spm.SentencePieceProcessor()
	sp.Load(sp_path)
	```