|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ko |
|
|
--- |
|
|
<div align="center"> |
|
|
<h1>Korcen</h1> |
|
|
</div> |
|
|
|
|
|
 |
|
|
|
|
|
korcen-ml์ ๊ธฐ์กด ํค์๋ ๊ธฐ๋ฐ์ korcen์ ์ฐํ๊ฐ ์ฝ๋ค๋ ๋จ์ ์ ๊ทน๋ณตํ๊ธฐ์ํด ๋ฅ๋ฌ๋์ ํตํด ์ ํ๋๋ฅผ ํ์ธต ๋ ์ฌ๋ฆฌ๋ ค๋ ํ๋ก์ ํธ์
๋๋ค. |
|
|
|
|
|
์ผ๋ถ ๋ชจ๋ธ๋ง ๊ณต๊ฐํ๊ณ ์์ผ๋ฉฐ ๋ชจ๋ธ ํ์ผ์ [์ฌ๊ธฐ](https://github.com/KR-korcen/korcen-ml/tree/main/model)์์ ํ์ธ์ด ๊ฐ๋ฅํฉ๋๋ค. |
|
|
|
|
|
๋ ๋ง์ ๋ชจ๋ธ ํ์ผ๊ณผ ํ์ต ๋ฐ์ดํฐ๋ฅผ ๋ค์ด๋ฐ๊ณ ์ถ๋ค๋ฉด ๋ฌธ์์ฃผ์ธ์. |
|
|
|
|
|
| | ๋ฐ์ดํฐ ๋ฌธ์ฅ์ | |
|
|
|------|------| |
|
|
| VDCNN(23.4.30) | 200,000๊ฐ | |
|
|
| VDCNN_KOGPT2(23.5.28) | 2,000,000๊ฐ | |
|
|
| VDCNN_LLAMA2(23.9.30) | 5,000,000๊ฐ | |
|
|
| VDCNN_LLAMA2_V2(24.1.29) | 10,000,000๊ฐ | |
|
|
|
|
|
|
|
|
ํค์๋ ๊ธฐ๋ฐ ๊ธฐ์กด ๋ผ์ด๋ธ๋ฌ๋ฆฌ : [py version](https://github.com/KR-korcen/korcen), [ts version](https://github.com/KR-korcen/korcen.ts) |
|
|
|
|
|
[์ํฌํธ ๋์ค์ฝ๋ ์๋ฒ](https://discord.gg/wyTU3ZQBPE) |
|
|
|
|
|
## ๋ชจ๋ธ ๊ฒ์ฆ |
|
|
๋ฐ์ดํฐ๋ง๋ค ์์ค์ ๊ธฐ์ค์ด ๋ฌ๋ผ ์ค์ฐจ๊ฐ ์๋ค๋ ๊ฑธ ๊ฐ์ํ๊ณ ํ์ธํ์๊ธฐ ๋ฐ๋๋๋ค. |
|
|
|
|
|
|
|
|
| | [korean-malicious-comments-dataset](https://github.com/ZIZUN/korean-malicious-comments-dataset) | [Curse-detection-data](https://github.com/2runo/Curse-detection-data) | [kmhas_korean_hate_speech](https://huggingface.co/datasets/jeanlee/kmhas_korean_hate_speech) | [Korean Extremist Website Womad Hate Speech Data](https://www.kaggle.com/datasets/captainnemo9292/korean-extremist-website-womad-hate-speech-data/data) | |
|
|
|------|------|------|------|------| |
|
|
| [korcen(v0.3.5)](https://github.com/KR-korcen/korcen) | 0.7121 | **0.8415** | 0.6800 | 0.6305 | |
|
|
| VDCNN(23.4.30) | 0.6900 | 0.4885 | | 0.4885 | |
|
|
| VDCNN_KOGPT2(23.6.15) | 0.7545 | 0.7824 | | 0.7055 | |
|
|
| VDCNN_LLAMA2(23.9.30) | 0.7762 | 0.8104 | 0.7296 | V2๋ก ๋์ฒด | |
|
|
| VDCNN_LLAMA2_V2(24.1.29) | **0.8322** | 0.8410 | **0.7837** | **0.7120** | |
|
|
| [badword_check](https://github.com/Nam-SW/badword_check)(23.10.1) | 0.5829 | 0.6761 | | | |
|
|
| [CurseDetector](https://github.com/mangto/CurseDetector)(24.1.10) | 0.5679 | ์๊ฐ์์๋ก ํ
์คํธ ๋ธ๊ฐ | | 0.5785 | |
|
|
|
|
|
## example |
|
|
```py |
|
|
#py: 3.10, tf: 2.10 |
|
|
import tensorflow as tf |
|
|
import numpy as np |
|
|
import pickle |
|
|
from tensorflow.keras.preprocessing.sequence import pad_sequences |
|
|
|
|
|
maxlen = 1000 |
|
|
|
|
|
model_path = 'vdcnn_model.h5' |
|
|
tokenizer_path = "tokenizer.pickle" |
|
|
|
|
|
model = tf.keras.models.load_model(model_path) |
|
|
with open(tokenizer_path, "rb") as f: |
|
|
tokenizer = pickle.load(f) |
|
|
|
|
|
def preprocess_text(text): |
|
|
text = text.lower() |
|
|
|
|
|
return text |
|
|
|
|
|
def predict_text(text): |
|
|
sentence = preprocess_text(text) |
|
|
encoded_sentence = tokenizer.encode_plus(sentence, |
|
|
max_length=maxlen, |
|
|
padding="max_length", |
|
|
truncation=True)['input_ids'] |
|
|
sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post") |
|
|
prediction = model.predict(sentence_seq)[0][0] |
|
|
return prediction |
|
|
|
|
|
while True: |
|
|
text = input("Enter the sentence you want to test: ") |
|
|
result = predict_text(text) |
|
|
if result >= 0.5: |
|
|
print("This sentence contains abusive language.") |
|
|
else: |
|
|
print("It's a normal sentence.") |
|
|
``` |
|
|
|
|
|
|
|
|
## Maker |
|
|
|
|
|
|
|
|
>Tanat |
|
|
``` |
|
|
github: Tanat05 |
|
|
discord: Tanat05 |
|
|
email: tanat@tanat.kr |
|
|
``` |