bongsoo commited on
Commit
6694a00
·
1 Parent(s): b541669

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -0
README.md CHANGED
@@ -1,3 +1,39 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: fill-mask
4
+ tags:
5
+ - fill-mask
6
+ - transformers
7
+ - en
8
+ - ko
9
  ---
10
+ # mdistilbertV2.0
11
+
12
+ - bert-base-multilingual-cased 모델에 [moco-corpus-kowiki2022 말뭉치](https://huggingface.co/datasets/bongsoo/moco-corpus-kowiki2022)(kowiki202206 + MOCOMSYS 추출 3.2M 문장)로 vocab 추가하여 학습 시킨 모델
13
+ - **vocab: 152,537개**(기존 bert 모델 vocab(119,548개)에 32,989개 vocab 추가)
14
+
15
+ ## Usage (HuggingFace Transformers)
16
+
17
+ ```python
18
+ from transformers import AutoTokenizer, AutoModel
19
+ import torch
20
+
21
+ tokenizer = AutoTokenizer.from_pretrained('bongsoo/mbertV2.0', do_lower_case=False)
22
+ model = AutoModel.from_pretrained('bongsoo/mbertV2.0')
23
+
24
+ ```
25
+ ## Training
26
+
27
+ **MLM(Masked Langeuage Model) 훈련**
28
+ - 입력 모델 : bert-base-multilingual-cased
29
+ - 말뭉치 : 훈련 : bongsoo/moco-corpus-kowiki2022(7.6M) , 평가: bongsoo/bongevalsmall
30
+ - HyperParameter : LearningRate : 5e-5, epochs: 8, batchsize: 32, max_token_len : 128
31
+ - vocab : 152,537개 (기존 119,548 에 32,989 신규 vocab 추가)
32
+ - 출력 모델 : mbertV2.0 (size: 776MB)
33
+ - 훈련시간 : 90h/1GPU (24GB/19.6GB use)
34
+ - loss : 훈련loss: 2.258400, 평가loss: 3.102096, perplexity: 19.78158(bong_eval:1,500)
35
+ - 훈련코드 [여기](https://github.com/kobongsoo/BERT/blob/master/bert/bert-MLM-Trainer-V1.2.ipynb) 참조
36
+
37
+ ## Citing & Authors
38
+
39
+ bongsoo