SaulLu
/

albert-bn-dev

Model card Files Files and versions

SaulLu commited on Apr 22, 2021

Commit

7a12445

·

1 Parent(s): 81c2f45

add Readme

Files changed (1) hide show

README.md +104 -0

README.md ADDED Viewed

	@@ -0,0 +1,104 @@

+---
+language:
+    - bn
+thumbnail:
+tags:
+    -
+license: apache-2.0
+datasets:
+    - oscar
+    - wikipedia
+metrics:
+    -
+---
+# [WIP] Albert Bengali - dev version
+## Model description
+For the moment, only the tokenizer is available. The tokenizer is based on [SentencePiece](https://github.com/google/sentencepiece) with Unigram language model segmentation algorithm.
+Taking into account certain characteristics of the language, we chose that:
+-   the tokenizer passes in lower case all the texts because The Bengali language being a monocameral scrip (no difference between capital and lower case);
+-   the sentence pieces can't go beyond the boundary of a word because the words are spaced by white spaces in the Bengali language.
+## Intended uses & limitations
+This tokenizer is adapted to the Bengali language. You can use it to pre-train an Albert model on the Bengali language.
+#### How to use
+To tokenize:
+```python
+from transformers import AlbertTokenizer
+tokenizer = AlbertTokenizer.from_pretrained('SaulLu/albert-bn-dev')
+text = "পোকেমন  জাপানী ভিডিও গেম কোম্পানি নিনটেন্ডো কর্তৃক প্রকাশিত একটি মিডিয়া ফ্র‍্যাঞ্চাইজি।"
+encoded_input = tokenizer(text, return_tensors='pt')
+```
+#### Limitations and bias
+Provide examples of latent issues and potential remediations.
+## Training data
+The tokenizer was trained on a random subset of 4M sentences of Bengali Oscar and Bengali Wikipedia.
+## Training procedure
+### Tokenizer
+The tokenizer was trained with the [SentencePiece](https://github.com/google/sentencepiece) on 8 x Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz with 16GB RAM and 36GB SWAP.
+```
+import sentencepiece as spm
+config = {
+    "input": "./dataset/oscar_bn/oscar_bn.txt,./dataset/wikipedia_bn/wikipedia_bn.txt",
+    "input_format": "text",
+    "model_type": "unigram",
+    "vocab_size": 32000,
+    "self_test_sample_size": 0,
+    "character_coverage": 0.9995,
+    "shuffle_input_sentence": true,
+    "seed_sentencepiece_size": 1000000,
+    "shrinking_factor": 0.75,
+    "num_threads": 8,
+    "num_sub_iterations": 2,
+    "max_sentencepiece_length": 16,
+    "max_sentence_length": 4192,
+    "split_by_unicode_script": true,
+    "split_by_number": true,
+    "split_digits": true,
+    "control_symbols": "[MASK]",
+    "byte_fallback": false,
+    "vocabulary_output_piece_score": true,
+    "normalization_rule_name": "nmt_nfkc_cf",
+    "add_dummy_prefix": true,
+    "remove_extra_whitespaces": true,
+    "hard_vocab_limit": true,
+    "unk_id": 1,
+    "bos_id": 2,
+    "eos_id": 3,
+    "pad_id": 0,
+    "bos_piece": "[CLS]",
+    "eos_piece": "[SEP]",
+    "train_extremely_large_corpus": true,
+    "split_by_whitespace": true,
+    "model_prefix": "./tokenizer_bn/data/oscar_wiki_bn_spm_unigram_4000000_2021_04_21_17_06_50/spiece",
+    "input_sentence_size": 4000000,
+    "user_defined_symbols": "(,),\",-,.,–,£"
+}
+spm.SentencePieceTrainer.train(**config)
+```
+<!-- ## Eval results
+### BibTeX entry and citation info
+```bibtex
+@inproceedings{...,
+  year={2020}
+}
+``` -->