Update README.md
Browse files
README.md
CHANGED
|
@@ -11,18 +11,18 @@ datasets:
|
|
| 11 |
- BanglaLM dataset
|
| 12 |
---
|
| 13 |
# Bangla BERT Base
|
| 14 |
-
Here we published a pretrained Bangla bert language model as **bert
|
| 15 |
Here we described [bert-base-bangla](https://github.com/Kowsher/bert-base-bangla) which is a pretrained Bangla language model based on mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and the GitHub [repository](https://github.com/google-research/bert)
|
| 16 |
## Corpus Details
|
| 17 |
We trained the Bangla bert language model using BanglaLM dataset from kaggle [BanglaLM](https://www.kaggle.com/gakowsher/bangla-language-model-dataset). There is 3 version of dataset which is almost 40GB.
|
| 18 |
After downloading the dataset, we went on the way to mask LM.
|
| 19 |
|
| 20 |
|
| 21 |
-
**
|
| 22 |
|
| 23 |
```py
|
| 24 |
from transformers import AutoTokenizer, AutoModel
|
| 25 |
-
bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bert
|
| 26 |
text = "খাঁটি সোনার চাইতে খাঁটি আমার দেশের মাটি"
|
| 27 |
bnbert_tokenizer.tokenize(text)
|
| 28 |
# output: ['খাটি', 'সে', '##ানার', 'চাইতে', 'খাটি', 'আমার', 'দেশের', 'মাটি']
|
|
@@ -31,8 +31,8 @@ bnbert_tokenizer.tokenize(text)
|
|
| 31 |
here, we can use bert base bangla model as for masked language modeling:
|
| 32 |
```py
|
| 33 |
from transformers import BertForMaskedLM, BertTokenizer, pipeline
|
| 34 |
-
model = BertForMaskedLM.from_pretrained("Kowsher/bert
|
| 35 |
-
tokenizer = BertTokenizer.from_pretrained("Kowsher/bert
|
| 36 |
|
| 37 |
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
|
| 38 |
for pred in nlp(f"আমি বাংলার গান {nlp.tokenizer.mask_token}"):
|
|
|
|
| 11 |
- BanglaLM dataset
|
| 12 |
---
|
| 13 |
# Bangla BERT Base
|
| 14 |
+
Here we published a pretrained Bangla bert language model as **bangla-bert**! which is now available in huggingface model hub.
|
| 15 |
Here we described [bert-base-bangla](https://github.com/Kowsher/bert-base-bangla) which is a pretrained Bangla language model based on mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and the GitHub [repository](https://github.com/google-research/bert)
|
| 16 |
## Corpus Details
|
| 17 |
We trained the Bangla bert language model using BanglaLM dataset from kaggle [BanglaLM](https://www.kaggle.com/gakowsher/bangla-language-model-dataset). There is 3 version of dataset which is almost 40GB.
|
| 18 |
After downloading the dataset, we went on the way to mask LM.
|
| 19 |
|
| 20 |
|
| 21 |
+
**bangla-bert Tokenizer**
|
| 22 |
|
| 23 |
```py
|
| 24 |
from transformers import AutoTokenizer, AutoModel
|
| 25 |
+
bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bangla-bert")
|
| 26 |
text = "খাঁটি সোনার চাইতে খাঁটি আমার দেশের মাটি"
|
| 27 |
bnbert_tokenizer.tokenize(text)
|
| 28 |
# output: ['খাটি', 'সে', '##ানার', 'চাইতে', 'খাটি', 'আমার', 'দেশের', 'মাটি']
|
|
|
|
| 31 |
here, we can use bert base bangla model as for masked language modeling:
|
| 32 |
```py
|
| 33 |
from transformers import BertForMaskedLM, BertTokenizer, pipeline
|
| 34 |
+
model = BertForMaskedLM.from_pretrained("Kowsher/bangla-bert")
|
| 35 |
+
tokenizer = BertTokenizer.from_pretrained("Kowsher/bangla-bert")
|
| 36 |
|
| 37 |
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
|
| 38 |
for pred in nlp(f"আমি বাংলার গান {nlp.tokenizer.mask_token}"):
|