Update README.md
Browse files
README.md
CHANGED
|
@@ -1,18 +1,3 @@
|
|
| 1 |
-
---
|
| 2 |
-
language:
|
| 3 |
-
- sd
|
| 4 |
-
license: mit
|
| 5 |
-
tags:
|
| 6 |
-
- sindhi
|
| 7 |
-
- bert
|
| 8 |
-
- masked-language-modeling
|
| 9 |
-
- from-scratch
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
# Sindhi-BERT-base
|
| 13 |
-
|
| 14 |
-
First BERT-style model trained from scratch on Sindhi text.
|
| 15 |
-
|
| 16 |
# Sindhi-BERT-base
|
| 17 |
|
| 18 |
The first BERT-style language model trained **from scratch** on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
|
|
@@ -268,21 +253,4 @@ fill_mask('سنڌي [MASK] دنيا جي قديم ٻولين مان ھڪ آھي'
|
|
| 268 |
|
| 269 |
## About
|
| 270 |
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
The corpus was carefully cleaned using a custom pipeline including Unicode normalization, script standardization, he-character normalization (ھ/ه/ہ), and word-level corrections using a 9,355-entry Sindhi dictionary.
|
| 274 |
-
|
| 275 |
-
## Usage
|
| 276 |
-
|
| 277 |
-
```python
|
| 278 |
-
from transformers import AutoModelForMaskedLM
|
| 279 |
-
import sentencepiece as spm, torch
|
| 280 |
-
import torch.nn.functional as F
|
| 281 |
-
from huggingface_hub import hf_hub_download
|
| 282 |
-
|
| 283 |
-
model = AutoModelForMaskedLM.from_pretrained("hellosindh/sindhi-bert-base")
|
| 284 |
-
sp_path = hf_hub_download("hellosindh/sindhi-bert-base", "sindhi_bpe_32k.model")
|
| 285 |
-
sp = spm.SentencePieceProcessor()
|
| 286 |
-
sp.Load(sp_path)
|
| 287 |
-
MASK_ID = 32000
|
| 288 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Sindhi-BERT-base
|
| 2 |
|
| 3 |
The first BERT-style language model trained **from scratch** on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
|
|
|
|
| 253 |
|
| 254 |
## About
|
| 255 |
|
| 256 |
+
The corpus was carefully cleaned using a custom pipeline including Unicode normalization, script standardization, he-character normalization (ھ/ه/ہ), and word-level corrections using a 9,355-entry Sindhi dictionary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|