hellosindh commited on
Commit
c03341d
·
verified ·
1 Parent(s): feaec69

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -33
README.md CHANGED
@@ -1,18 +1,3 @@
1
- ---
2
- language:
3
- - sd
4
- license: mit
5
- tags:
6
- - sindhi
7
- - bert
8
- - masked-language-modeling
9
- - from-scratch
10
- ---
11
-
12
- # Sindhi-BERT-base
13
-
14
- First BERT-style model trained from scratch on Sindhi text.
15
-
16
  # Sindhi-BERT-base
17
 
18
  The first BERT-style language model trained **from scratch** on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
@@ -268,21 +253,4 @@ fill_mask('سنڌي [MASK] دنيا جي قديم ٻولين مان ھڪ آھي'
268
 
269
  ## About
270
 
271
- This model is part of a larger effort to build complete NLP tools for the Sindhi language one of the oldest languages in the world with over 30 million speakers across Pakistan and India.
272
-
273
- The corpus was carefully cleaned using a custom pipeline including Unicode normalization, script standardization, he-character normalization (ھ/ه/ہ), and word-level corrections using a 9,355-entry Sindhi dictionary.
274
-
275
- ## Usage
276
-
277
- ```python
278
- from transformers import AutoModelForMaskedLM
279
- import sentencepiece as spm, torch
280
- import torch.nn.functional as F
281
- from huggingface_hub import hf_hub_download
282
-
283
- model = AutoModelForMaskedLM.from_pretrained("hellosindh/sindhi-bert-base")
284
- sp_path = hf_hub_download("hellosindh/sindhi-bert-base", "sindhi_bpe_32k.model")
285
- sp = spm.SentencePieceProcessor()
286
- sp.Load(sp_path)
287
- MASK_ID = 32000
288
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Sindhi-BERT-base
2
 
3
  The first BERT-style language model trained **from scratch** on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
 
253
 
254
  ## About
255
 
256
+ The corpus was carefully cleaned using a custom pipeline including Unicode normalization, script standardization, he-character normalization (ھ/ه/ہ), and word-level corrections using a 9,355-entry Sindhi dictionary.