Update README.md
#2
by
TakSung
- opened
README.md
CHANGED
|
@@ -2,4 +2,43 @@
|
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
- ko
|
| 5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
- ko
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Kconvo-roberta: Korean conversation RoBERTa ([github](https://github.com/HeoTaksung/Domain-Robust-Retraining-of-Pretrained-Language-Model))
|
| 8 |
+
- There are many PLMs (Pretrained Language Models) for Korean, but most of them exist for written language.
|
| 9 |
+
- Here, we introduce a retrained PLM for prediction of Korean conversation data.
|
| 10 |
+
|
| 11 |
+
## Usage
|
| 12 |
+
```python
|
| 13 |
+
# Kconvo-roberta
|
| 14 |
+
from transformers import RobertaTokenizerFast, RobertaModel
|
| 15 |
+
|
| 16 |
+
tokenizer_roberta = RobertaTokenizerFast.from_pretrained("yeongjoon/Kconvo-roberta")
|
| 17 |
+
model_roberta = RobertaModel.from_pretrained("yeongjoon/Kconvo-roberta")
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
-----------------
|
| 21 |
+
## Domain Robust Retraining of Pretrained Language Model
|
| 22 |
+
|
| 23 |
+
- Kconvo-roberta uses [klue/roberta-base](https://huggingface.co/klue/roberta-base) as the basic model and additionally retrains the conversation dataset.
|
| 24 |
+
- The retrained dataset was collected through the [National Institute of the Korean Language](https://corpus.korean.go.kr/request/corpusRegist.do) and [AI-Hub](https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=115&topMenu=100&dataSetSn=&srchdataClCode=DATACL001&srchOrder=&SrchdataClCode=DATACL002&searchKeyword=&srchDataRealmCode=REALM002&srchDataTy=DATA003), and the collected dataset is as follows.
|
| 25 |
+
|
| 26 |
+
```
|
| 27 |
+
- National Institute of the Korean Language
|
| 28 |
+
* ์จ๋ผ์ธ ๋ํ ๋ง๋ญ์น 2021
|
| 29 |
+
* ์ผ์ ๋ํ ๋ง๋ญ์น 2020
|
| 30 |
+
* ๊ตฌ์ด ๋ง๋ญ์น
|
| 31 |
+
* ๋ฉ์ ์ ๋ง๋ญ์น
|
| 32 |
+
|
| 33 |
+
- AI-Hub
|
| 34 |
+
* ์จ๋ผ์ธ ๊ตฌ์ด์ฒด ๋ง๋ญ์น ๋ฐ์ดํฐ
|
| 35 |
+
* ์๋ด ์์ฑ
|
| 36 |
+
* ํ๊ตญ์ด ์์ฑ
|
| 37 |
+
* ์์ ๋ํ ์์ฑ(์ผ๋ฐ๋จ์ฌ)
|
| 38 |
+
* ์ผ์์ํ ๋ฐ ๊ตฌ์ด์ฒด ํ-์ ๋ฒ์ญ ๋ณ๋ ฌ ๋ง๋ญ์น ๋ฐ์ดํฐ
|
| 39 |
+
* ํ๊ตญ์ธ ๋ํ์์ฑ
|
| 40 |
+
* ๊ฐ์ฑ ๋ํ ๋ง๋ญ์น
|
| 41 |
+
* ์ฃผ์ ๋ณ ํ
์คํธ ์ผ์ ๋ํ ๋ฐ์ดํฐ
|
| 42 |
+
* ์ฉ๋๋ณ ๋ชฉ์ ๋ํ ๋ฐ์ดํฐ
|
| 43 |
+
* ํ๊ตญ์ด SNS
|
| 44 |
+
```
|