SaulLu commited on
Commit
7a12445
·
1 Parent(s): 81c2f45

add Readme

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ thumbnail:
5
+ tags:
6
+ -
7
+ license: apache-2.0
8
+ datasets:
9
+ - oscar
10
+ - wikipedia
11
+ metrics:
12
+ -
13
+ ---
14
+
15
+ # [WIP] Albert Bengali - dev version
16
+
17
+ ## Model description
18
+
19
+ For the moment, only the tokenizer is available. The tokenizer is based on [SentencePiece](https://github.com/google/sentencepiece) with Unigram language model segmentation algorithm.
20
+
21
+ Taking into account certain characteristics of the language, we chose that:
22
+
23
+ - the tokenizer passes in lower case all the texts because The Bengali language being a monocameral scrip (no difference between capital and lower case);
24
+ - the sentence pieces can't go beyond the boundary of a word because the words are spaced by white spaces in the Bengali language.
25
+
26
+ ## Intended uses & limitations
27
+
28
+ This tokenizer is adapted to the Bengali language. You can use it to pre-train an Albert model on the Bengali language.
29
+
30
+ #### How to use
31
+
32
+ To tokenize:
33
+
34
+ ```python
35
+ from transformers import AlbertTokenizer
36
+ tokenizer = AlbertTokenizer.from_pretrained('SaulLu/albert-bn-dev')
37
+ text = "পোকেমন জাপানী ভিডিও গেম কোম্পানি নিনটেন্ডো কর্তৃক প্রকাশিত একটি মিডিয়া ফ্র‍্যাঞ্চাইজি।"
38
+ encoded_input = tokenizer(text, return_tensors='pt')
39
+ ```
40
+
41
+ #### Limitations and bias
42
+
43
+ Provide examples of latent issues and potential remediations.
44
+
45
+ ## Training data
46
+
47
+ The tokenizer was trained on a random subset of 4M sentences of Bengali Oscar and Bengali Wikipedia.
48
+
49
+ ## Training procedure
50
+
51
+ ### Tokenizer
52
+
53
+ The tokenizer was trained with the [SentencePiece](https://github.com/google/sentencepiece) on 8 x Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz with 16GB RAM and 36GB SWAP.
54
+
55
+ ```
56
+ import sentencepiece as spm
57
+ config = {
58
+ "input": "./dataset/oscar_bn/oscar_bn.txt,./dataset/wikipedia_bn/wikipedia_bn.txt",
59
+ "input_format": "text",
60
+ "model_type": "unigram",
61
+ "vocab_size": 32000,
62
+ "self_test_sample_size": 0,
63
+ "character_coverage": 0.9995,
64
+ "shuffle_input_sentence": true,
65
+ "seed_sentencepiece_size": 1000000,
66
+ "shrinking_factor": 0.75,
67
+ "num_threads": 8,
68
+ "num_sub_iterations": 2,
69
+ "max_sentencepiece_length": 16,
70
+ "max_sentence_length": 4192,
71
+ "split_by_unicode_script": true,
72
+ "split_by_number": true,
73
+ "split_digits": true,
74
+ "control_symbols": "[MASK]",
75
+ "byte_fallback": false,
76
+ "vocabulary_output_piece_score": true,
77
+ "normalization_rule_name": "nmt_nfkc_cf",
78
+ "add_dummy_prefix": true,
79
+ "remove_extra_whitespaces": true,
80
+ "hard_vocab_limit": true,
81
+ "unk_id": 1,
82
+ "bos_id": 2,
83
+ "eos_id": 3,
84
+ "pad_id": 0,
85
+ "bos_piece": "[CLS]",
86
+ "eos_piece": "[SEP]",
87
+ "train_extremely_large_corpus": true,
88
+ "split_by_whitespace": true,
89
+ "model_prefix": "./tokenizer_bn/data/oscar_wiki_bn_spm_unigram_4000000_2021_04_21_17_06_50/spiece",
90
+ "input_sentence_size": 4000000,
91
+ "user_defined_symbols": "(,),\",-,.,–,£"
92
+ }
93
+ spm.SentencePieceTrainer.train(**config)
94
+ ```
95
+
96
+ <!-- ## Eval results
97
+
98
+ ### BibTeX entry and citation info
99
+
100
+ ```bibtex
101
+ @inproceedings{...,
102
+ year={2020}
103
+ }
104
+ ``` -->