yeongjoon
/

Kconvo-roberta

Model card Files Files and versions

Kconvo-roberta / README.md

yeongjoon's picture

Update README.md

3840cc7 almost 3 years ago

|

history blame contribute delete

1.87 kB

	---
	license: mit
	language:
	- ko
	---

	# Kconvo-roberta: Korean conversation RoBERTa ([github](https://github.com/HeoTaksung/Domain-Robust-Retraining-of-Pretrained-Language-Model))
	- There are many PLMs (Pretrained Language Models) for Korean, but most of them are trained with written language.
	- Here, we introduce a retrained PLM for prediction of Korean conversation data where we use verbal data for training.

	## Usage
	```python
	# Kconvo-roberta
	from transformers import RobertaTokenizerFast, RobertaModel

	tokenizer_roberta = RobertaTokenizerFast.from_pretrained("yeongjoon/Kconvo-roberta")
	model_roberta = RobertaModel.from_pretrained("yeongjoon/Kconvo-roberta")
	```

	-----------------
	## Domain Robust Retraining of Pretrained Language Model

	- Kconvo-roberta uses [klue/roberta-base](https://huggingface.co/klue/roberta-base) as the base model and retrained additionaly with the conversation dataset.
	- The retrained dataset was collected through the [National Institute of the Korean Language](https://corpus.korean.go.kr/request/corpusRegist.do) and [AI-Hub](https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=115&topMenu=100&dataSetSn=&srchdataClCode=DATACL001&srchOrder=&SrchdataClCode=DATACL002&searchKeyword=&srchDataRealmCode=REALM002&srchDataTy=DATA003), and the collected dataset is as follows.

	```
	- National Institute of the Korean Language
	* 온라인 대화 말뭉치 2021
	* 일상 대화 말뭉치 2020
	* 구어 말뭉치
	* 메신저 말뭉치

	- AI-Hub
	* 온라인 구어체 말뭉치 데이터
	* 상담 음성
	* 한국어 음성
	* 자유대화 음성(일반남여)
	* 일상생활 및 구어체 한-영 번역 병렬 말뭉치 데이터
	* 한국인 대화음성
	* 감성 대화 말뭉치
	* 주제별 텍스트 일상 대화 데이터
	* 용도별 목적대화 데이터
	* 한국어 SNS
	```