| # Bad_text_classifier | |
| ## Model ์๊ฐ | |
| ์ธํฐ๋ท ์์ ํผ์ ธ์๋ ์ฌ๋ฌ ๋๊ธ, ์ฑํ ์ด ๋ฏผ๊ฐํ ๋ด์ฉ์ธ์ง ์๋์ง๋ฅผ ํ๋ณํ๋ ๋ชจ๋ธ์ ๊ณต๊ฐํฉ๋๋ค. ํด๋น ๋ชจ๋ธ์ ๊ณต๊ฐ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํด label์ ์์ ํ๊ณ ๋ฐ์ดํฐ๋ค์ ํฉ์ณ ๊ตฌ์ฑํด finetuning์ ์งํํ์์ต๋๋ค. ํด๋น ๋ชจ๋ธ์ด ์ธ์ ๋ ๋ชจ๋ ๋ฌธ์ฅ์ ์ ํํ ํ๋จ์ด ๊ฐ๋ฅํ ๊ฒ์ ์๋๋ผ๋ ์ ์ํดํด ์ฃผ์๋ฉด ๊ฐ์ฌ๋๋ฆฌ๊ฒ ์ต๋๋ค. | |
| ``` | |
| NOTE) | |
| ๊ณต๊ฐ ๋ฐ์ดํฐ์ ์ ์๊ถ ๋ฌธ์ ๋ก ์ธํด ๋ชจ๋ธ ํ์ต์ ์ฌ์ฉ๋ ๋ณํ๋ ๋ฐ์ดํฐ๋ ๊ณต๊ฐ ๋ถ๊ฐ๋ฅํ๋ค๋ ์ ์ ๋ฐํ๋๋ค. | |
| ๋ํ ํด๋น ๋ชจ๋ธ์ ์๊ฒฌ์ ์ ์๊ฒฌ๊ณผ ๋ฌด๊ดํ๋ค๋ ์ ์ ๋ฏธ๋ฆฌ ๋ฐํ๋๋ค. | |
| ``` | |
| ## Dataset | |
| ### data label | |
| * **0 : bad sentence** | |
| * **1 : not bad sentence** | |
| ### ์ฌ์ฉํ dataset | |
| * [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset) | |
| * [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech) | |
| ### dataset ๊ฐ๊ณต ๋ฐฉ๋ฒ | |
| ๊ธฐ์กด ์ด์ง ๋ถ๋ฅ๊ฐ ์๋์๋ ๋ ๋ฐ์ดํฐ๋ฅผ ์ด์ง ๋ถ๋ฅ ํํ๋ก labeling์ ๋ค์ ํด์ค ๋ค, Korean HateSpeech Dataset์ค label 1(not bad sentence)๋ง์ ์ถ๋ ค ๊ฐ๊ณต๋ Korean Unsmile Dataset์ ํฉ์ณ ์ฃผ์์ต๋๋ค. | |
| </br> | |
| **Korean Unsmile Dataset์ clean์ผ๋ก labeling ๋์ด์๋ ๋ฐ์ดํฐ ์ค ๋ช๊ฐ์ ๋ฐ์ดํฐ๋ฅผ 0 (bad sentence)์ผ๋ก ์์ ํ์์ต๋๋ค.** | |
| * "~๋ ธ"๊ฐ ํฌํจ๋ ๋ฌธ์ฅ ์ค, "์ด๊ธฐ", "๋ ธ๋ฌด"๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์ | |
| * "์ข", "๋ด" ๋ฑ ์ฑ ๊ด๋ จ ๋์์ค๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์ | |
| </br> | |
| ## Model Training | |
| * huggingface transformers์ ElectraForSequenceClassification๋ฅผ ์ฌ์ฉํด finetuning์ ์ํํ์์ต๋๋ค. | |
| * ํ๊ตญ์ด ๊ณต๊ฐ Electra ๋ชจ๋ธ ์ค 3๊ฐ์ง ๋ชจ๋ธ์ ์ฌ์ฉํด ๊ฐ๊ฐ ํ์ต์์ผ์ฃผ์์ต๋๋ค. | |
| ### use model | |
| * [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA) | |
| * [monologg/koELECTRA](https://github.com/monologg/KoELECTRA) | |
| * [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base) | |
| ## How to use model? | |
| ```PYTHON | |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
| model = AutoModelForSequenceClassification.from_pretrained('JminJ/koElectra_base_Bad_Sentence_Classifier') | |
| tokenizer = AutoTokenizer.from_pretrained('JminJ/koElectra_base_Bad_Sentence_Classifier') | |
| ``` | |
| ## Model Valid Accuracy | |
| | mdoel | accuracy | | |
| | ---------- | ---------- | | |
| | kcElectra_base_fp16_wd_custom_dataset | 0.8849 | | |
| | tunibElectra_base_fp16_wd_custom_dataset | 0.8726 | | |
| | koElectra_base_fp16_wd_custom_dataset | 0.8434 | | |
| ``` | |
| Note) | |
| ๋ชจ๋ ๋ชจ๋ธ์ ๋์ผํ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋ก ํ์ต๋์์ต๋๋ค. | |
| ``` | |
| ## Contact | |
| * jminju254@gmail.com | |
| </br></br> | |
| ## Github | |
| * https://github.com/JminJ/Bad_text_classifier | |
| </br></br> | |
| ## Reference | |
| * [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA) | |
| * [monologg/koELECTRA](https://github.com/monologg/KoELECTRA) | |
| * [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base) | |
| * [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset) | |
| * [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech) | |
| * [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555) | |