Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Bad_text_classifier
|
| 2 |
+
|
| 3 |
+
## Model ์๊ฐ
|
| 4 |
+
์ธํฐ๋ท ์์ ํผ์ ธ์๋ ์ฌ๋ฌ ๋๊ธ, ์ฑํ
์ด ๋ฏผ๊ฐํ ๋ด์ฉ์ธ์ง ์๋์ง๋ฅผ ํ๋ณํ๋ ๋ชจ๋ธ์ ๊ณต๊ฐํฉ๋๋ค. ํด๋น ๋ชจ๋ธ์ ๊ณต๊ฐ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํด label์ ์์ ํ๊ณ ๋ฐ์ดํฐ๋ค์ ํฉ์ณ ๊ตฌ์ฑํด finetuning์ ์งํํ์์ต๋๋ค. ํด๋น ๋ชจ๋ธ์ด ์ธ์ ๋ ๋ชจ๋ ๋ฌธ์ฅ์ ์ ํํ ํ๋จ์ด ๊ฐ๋ฅํ ๊ฒ์ ์๋๋ผ๋ ์ ์ํดํด ์ฃผ์๋ฉด ๊ฐ์ฌ๋๋ฆฌ๊ฒ ์ต๋๋ค.
|
| 5 |
+
```
|
| 6 |
+
NOTE)
|
| 7 |
+
๊ณต๊ฐ ๋ฐ์ดํฐ์ ์ ์๊ถ ๋ฌธ์ ๋ก ์ธํด ๋ชจ๋ธ ํ์ต์ ์ฌ์ฉ๋ ๋ณํ๋ ๋ฐ์ดํฐ๋ ๊ณต๊ฐ ๋ถ๊ฐ๋ฅํ๋ค๋ ์ ์ ๋ฐํ๋๋ค.
|
| 8 |
+
๋ํ ํด๋น ๋ชจ๋ธ์ ์๊ฒฌ์ ์ ์๊ฒฌ๊ณผ ๋ฌด๊ดํ๋ค๋ ์ ์ ๋ฏธ๋ฆฌ ๋ฐํ๋๋ค.
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
## Dataset
|
| 12 |
+
### data label
|
| 13 |
+
* **0 : bad sentence**
|
| 14 |
+
* **1 : not bad sentence**
|
| 15 |
+
### ์ฌ์ฉํ dataset
|
| 16 |
+
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
|
| 17 |
+
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
|
| 18 |
+
### dataset ๊ฐ๊ณต ๋ฐฉ๋ฒ
|
| 19 |
+
๊ธฐ์กด ์ด์ง ๋ถ๋ฅ๊ฐ ์๋์๋ ๋ ๋ฐ์ดํฐ๋ฅผ ์ด์ง ๋ถ๋ฅ ํํ๋ก labeling์ ๋ค์ ํด์ค ๋ค, Korean HateSpeech Dataset์ค label 1(not bad sentence)๋ง์ ์ถ๋ ค ๊ฐ๊ณต๋ Korean Unsmile Dataset์ ํฉ์ณ ์ฃผ์์ต๋๋ค.
|
| 20 |
+
</br>
|
| 21 |
+
|
| 22 |
+
**Korean Unsmile Dataset์ clean์ผ๋ก labeling ๋์ด์๋ ๋ฐ์ดํฐ ์ค ๋ช๊ฐ์ ๋ฐ์ดํฐ๋ฅผ 0 (bad sentence)์ผ๋ก ์์ ํ์์ต๋๋ค.**
|
| 23 |
+
* "~๋
ธ"๊ฐ ํฌํจ๋ ๋ฌธ์ฅ ์ค, "์ด๊ธฐ", "๋
ธ๋ฌด"๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์
|
| 24 |
+
* "์ข", "๋ด" ๋ฑ ์ฑ ๊ด๋ จ ๋์์ค๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์
|
| 25 |
+
</br></br>
|
| 26 |
+
|
| 27 |
+
## Model Training
|
| 28 |
+
* huggingface transformers์ ElectraForSequenceClassification๋ฅผ ์ฌ์ฉํด finetuning์ ์ํํ์์ต๋๋ค.
|
| 29 |
+
* ํ๊ตญ์ด ๊ณต๊ฐ Electra ๋ชจ๋ธ ์ค 3๊ฐ์ง ๋ชจ๋ธ์ ์ฌ์ฉํด ๊ฐ๊ฐ ํ์ต์์ผ์ฃผ์์ต๋๋ค.
|
| 30 |
+
### use model
|
| 31 |
+
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
|
| 32 |
+
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
|
| 33 |
+
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
|
| 34 |
+
|
| 35 |
+
### how to train?
|
| 36 |
+
```BASH
|
| 37 |
+
python codes/model_source/train_torch_sch.py \
|
| 38 |
+
--learning_rate=3e-06 \
|
| 39 |
+
--use_float_16=True \
|
| 40 |
+
--weight-decay=0.001 \
|
| 41 |
+
--base_save_ckpt_path=BASE_SAVE_CHPT_PATH \
|
| 42 |
+
--epochs=10 \
|
| 43 |
+
--batch_size=128 \
|
| 44 |
+
--model_type=MODEL_TYPE
|
| 45 |
+
```
|
| 46 |
+
### parameters
|
| 47 |
+
| parameter | type | description | default |
|
| 48 |
+
| ---------- | ---------- | ---------- | --------- |
|
| 49 |
+
| learning_rate | float | decise learning rate for train | 5e-05 |
|
| 50 |
+
| use_float_16 | bool | decise to apply float 16 or not | False |
|
| 51 |
+
| weight_decay | float | define weight decay lambda | None |
|
| 52 |
+
| base_ckpt_save_path | str | base path that will be saved trained checkpoints | None |
|
| 53 |
+
| epochs | int | full train epochs | 5 |
|
| 54 |
+
| batch_size | int | batch size using in train time | 64 |
|
| 55 |
+
| model_type | int | used to choose what electra model using for training | 0 |
|
| 56 |
+
```
|
| 57 |
+
NOTE) train dataset, valid dataset์ train_torch_sch.py ๋ด์ config ๋ถ๋ถ์์ ์ง์ ํ์ค ์ ์์ต๋๋ค
|
| 58 |
+
```
|
| 59 |
+
</br>
|
| 60 |
+
|
| 61 |
+
## How to use model?
|
| 62 |
+
```PYTHON
|
| 63 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
| 64 |
+
|
| 65 |
+
model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
|
| 66 |
+
tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
|
| 67 |
+
```
|
| 68 |
+
</br>
|
| 69 |
+
|
| 70 |
+
## Predict model
|
| 71 |
+
์ฌ์ฉ์๊ฐ ํ
์คํธ ํด๋ณด๊ณ ์ถ์ ๋ฌธ์ฅ์ ๋ฃ์ด predict๋ฅผ ์ํํด ๋ณผ ์ ์์ต๋๋ค.
|
| 72 |
+
```BASH
|
| 73 |
+
python codes/model_source/utils/predict.py \
|
| 74 |
+
--input_text=INPUT_TEXT \
|
| 75 |
+
--base_ckpt=BASE_CKPT
|
| 76 |
+
```
|
| 77 |
+
### parameters
|
| 78 |
+
| parameter | type | description | default |
|
| 79 |
+
| ---------- | ---------- | ---------- | --------- |
|
| 80 |
+
| input_text | str | user input text | "๋ฐ๊ฐ์ต๋๋ค. JminJ์
๋๋ค!" |
|
| 81 |
+
| base_ckpt | str | base path that saved trained checkpoints | False |
|
| 82 |
+
</br>
|
| 83 |
+
|
| 84 |
+
## Model Valid Accuracy
|
| 85 |
+
| mdoel | accuracy |
|
| 86 |
+
| ---------- | ---------- |
|
| 87 |
+
| kcElectra_base_fp16_wd_custom_dataset | 0.8849 |
|
| 88 |
+
| tunibElectra_base_fp16_wd_custom_dataset | 0.8726 |
|
| 89 |
+
| koElectra_base_fp16_wd_custom_dataset | 0.8434 |
|
| 90 |
+
```
|
| 91 |
+
Note)
|
| 92 |
+
๋ชจ๋ ๋ชจ๋ธ์ ๋์ผํ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋ก ํ์ต๋์์ต๋๋ค.
|
| 93 |
+
```
|
| 94 |
+
</br>
|
| 95 |
+
|
| 96 |
+
## Contact
|
| 97 |
+
* jminju254@gmail.com
|
| 98 |
+
</br></br>
|
| 99 |
+
|
| 100 |
+
## Github
|
| 101 |
+
* https://github.com/JminJ/Bad_text_classifier
|
| 102 |
+
</br></br>
|
| 103 |
+
|
| 104 |
+
## Reference
|
| 105 |
+
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
|
| 106 |
+
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
|
| 107 |
+
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
|
| 108 |
+
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
|
| 109 |
+
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
|
| 110 |
+
* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)
|