yeongjoon
/

Kconvo-roberta

Model card Files Files and versions

Update README.md

#2

by TakSung - opened Mar 17, 2023

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

Files changed (1) hide show

README.md +40 -1

README.md CHANGED Viewed

@@ -2,4 +2,43 @@
 license: mit
 language:
 - ko
----

 license: mit
 language:
 - ko
+---
+# Kconvo-roberta: Korean conversation RoBERTa ([github](https://github.com/HeoTaksung/Domain-Robust-Retraining-of-Pretrained-Language-Model))
+- There are many PLMs (Pretrained Language Models) for Korean, but most of them exist for written language.
+- Here, we introduce a retrained PLM for prediction of Korean conversation data.
+## Usage
+```python
+# Kconvo-roberta
+from transformers import RobertaTokenizerFast, RobertaModel
+tokenizer_roberta = RobertaTokenizerFast.from_pretrained("yeongjoon/Kconvo-roberta")
+model_roberta = RobertaModel.from_pretrained("yeongjoon/Kconvo-roberta")
+```
+-----------------
+## Domain Robust Retraining of Pretrained Language Model
+- Kconvo-roberta uses [klue/roberta-base](https://huggingface.co/klue/roberta-base) as the basic model and additionally retrains the conversation dataset.
+- The retrained dataset was collected through the [National Institute of the Korean Language](https://corpus.korean.go.kr/request/corpusRegist.do) and [AI-Hub](https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=115&topMenu=100&dataSetSn=&srchdataClCode=DATACL001&srchOrder=&SrchdataClCode=DATACL002&searchKeyword=&srchDataRealmCode=REALM002&srchDataTy=DATA003), and the collected dataset is as follows.
+```
+- National Institute of the Korean Language
+   * 온라인 대화 말뭉치 2021
+   * 일상 대화 말뭉치 2020
+   * 구어 말뭉치
+   * 메신저 말뭉치
+- AI-Hub
+   * 온라인 구어체 말뭉치 데이터
+   * 상담 음성
+   * 한국어 음성
+   * 자유대화 음성(일반남여)
+   * 일상생활 및 구어체 한-영 번역 병렬 말뭉치 데이터
+   * 한국인 대화음성
+   * 감성 대화 말뭉치
+   * 주제별 텍스트 일상 대화 데이터
+   * 용도별 목적대화 데이터
+   * 한국어 SNS
+```