Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,39 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: fill-mask
|
| 4 |
+
tags:
|
| 5 |
+
- fill-mask
|
| 6 |
+
- transformers
|
| 7 |
+
- en
|
| 8 |
+
- ko
|
| 9 |
---
|
| 10 |
+
# mdistilbertV2.0
|
| 11 |
+
|
| 12 |
+
- bert-base-multilingual-cased 모델에 [moco-corpus-kowiki2022 말뭉치](https://huggingface.co/datasets/bongsoo/moco-corpus-kowiki2022)(kowiki202206 + MOCOMSYS 추출 3.2M 문장)로 vocab 추가하여 학습 시킨 모델
|
| 13 |
+
- **vocab: 152,537개**(기존 bert 모델 vocab(119,548개)에 32,989개 vocab 추가)
|
| 14 |
+
|
| 15 |
+
## Usage (HuggingFace Transformers)
|
| 16 |
+
|
| 17 |
+
```python
|
| 18 |
+
from transformers import AutoTokenizer, AutoModel
|
| 19 |
+
import torch
|
| 20 |
+
|
| 21 |
+
tokenizer = AutoTokenizer.from_pretrained('bongsoo/mbertV2.0', do_lower_case=False)
|
| 22 |
+
model = AutoModel.from_pretrained('bongsoo/mbertV2.0')
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
## Training
|
| 26 |
+
|
| 27 |
+
**MLM(Masked Langeuage Model) 훈련**
|
| 28 |
+
- 입력 모델 : bert-base-multilingual-cased
|
| 29 |
+
- 말뭉치 : 훈련 : bongsoo/moco-corpus-kowiki2022(7.6M) , 평가: bongsoo/bongevalsmall
|
| 30 |
+
- HyperParameter : LearningRate : 5e-5, epochs: 8, batchsize: 32, max_token_len : 128
|
| 31 |
+
- vocab : 152,537개 (기존 119,548 에 32,989 신규 vocab 추가)
|
| 32 |
+
- 출력 모델 : mbertV2.0 (size: 776MB)
|
| 33 |
+
- 훈련시간 : 90h/1GPU (24GB/19.6GB use)
|
| 34 |
+
- loss : 훈련loss: 2.258400, 평가loss: 3.102096, perplexity: 19.78158(bong_eval:1,500)
|
| 35 |
+
- 훈련코드 [여기](https://github.com/kobongsoo/BERT/blob/master/bert/bert-MLM-Trainer-V1.2.ipynb) 참조
|
| 36 |
+
|
| 37 |
+
## Citing & Authors
|
| 38 |
+
|
| 39 |
+
bongsoo
|