Kowsher commited on
Commit
01ab153
·
1 Parent(s): 26d4fc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -11,18 +11,18 @@ datasets:
11
  - BanglaLM dataset
12
  ---
13
  # Bangla BERT Base
14
- Here we published a pretrained Bangla bert language model as **bert-base-bangla**! which is now available in huggingface model hub.
15
  Here we described [bert-base-bangla](https://github.com/Kowsher/bert-base-bangla) which is a pretrained Bangla language model based on mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and the GitHub [repository](https://github.com/google-research/bert)
16
  ## Corpus Details
17
  We trained the Bangla bert language model using BanglaLM dataset from kaggle [BanglaLM](https://www.kaggle.com/gakowsher/bangla-language-model-dataset). There is 3 version of dataset which is almost 40GB.
18
  After downloading the dataset, we went on the way to mask LM.
19
 
20
 
21
- **Bangla Base BERT Tokenizer**
22
 
23
  ```py
24
  from transformers import AutoTokenizer, AutoModel
25
- bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bert-base-bangla")
26
  text = "খাঁটি সোনার চাইতে খাঁটি আমার দেশের মাটি"
27
  bnbert_tokenizer.tokenize(text)
28
  # output: ['খাটি', 'সে', '##ানার', 'চাইতে', 'খাটি', 'আমার', 'দেশের', 'মাটি']
@@ -31,8 +31,8 @@ bnbert_tokenizer.tokenize(text)
31
  here, we can use bert base bangla model as for masked language modeling:
32
  ```py
33
  from transformers import BertForMaskedLM, BertTokenizer, pipeline
34
- model = BertForMaskedLM.from_pretrained("Kowsher/bert-base-bangla")
35
- tokenizer = BertTokenizer.from_pretrained("Kowsher/bert-base-bangla")
36
 
37
  nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
38
  for pred in nlp(f"আমি বাংলার গান {nlp.tokenizer.mask_token}"):
 
11
  - BanglaLM dataset
12
  ---
13
  # Bangla BERT Base
14
+ Here we published a pretrained Bangla bert language model as **bangla-bert**! which is now available in huggingface model hub.
15
  Here we described [bert-base-bangla](https://github.com/Kowsher/bert-base-bangla) which is a pretrained Bangla language model based on mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and the GitHub [repository](https://github.com/google-research/bert)
16
  ## Corpus Details
17
  We trained the Bangla bert language model using BanglaLM dataset from kaggle [BanglaLM](https://www.kaggle.com/gakowsher/bangla-language-model-dataset). There is 3 version of dataset which is almost 40GB.
18
  After downloading the dataset, we went on the way to mask LM.
19
 
20
 
21
+ **bangla-bert Tokenizer**
22
 
23
  ```py
24
  from transformers import AutoTokenizer, AutoModel
25
+ bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bangla-bert")
26
  text = "খাঁটি সোনার চাইতে খাঁটি আমার দেশের মাটি"
27
  bnbert_tokenizer.tokenize(text)
28
  # output: ['খাটি', 'সে', '##ানার', 'চাইতে', 'খাটি', 'আমার', 'দেশের', 'মাটি']
 
31
  here, we can use bert base bangla model as for masked language modeling:
32
  ```py
33
  from transformers import BertForMaskedLM, BertTokenizer, pipeline
34
+ model = BertForMaskedLM.from_pretrained("Kowsher/bangla-bert")
35
+ tokenizer = BertTokenizer.from_pretrained("Kowsher/bangla-bert")
36
 
37
  nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
38
  for pred in nlp(f"আমি বাংলার গান {nlp.tokenizer.mask_token}"):