korcen / README.md
Tanat05's picture
Update README.md
dec8dec verified
---
license: apache-2.0
language:
- ko
---
<div align="center">
<h1>Korcen</h1>
</div>
![131_20220604170616](https://user-images.githubusercontent.com/85154556/171998341-9a7439c8-122f-4a9f-beb6-0e0b3aad05ed.png)
korcen-ml์€ ๊ธฐ์กด ํ‚ค์›Œ๋“œ ๊ธฐ๋ฐ˜์˜ korcen์˜ ์šฐํšŒ๊ฐ€ ์‰ฝ๋‹ค๋Š” ๋‹จ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ์œ„ํ•ด ๋”ฅ๋Ÿฌ๋‹์„ ํ†ตํ•ด ์ •ํ™•๋„๋ฅผ ํ•œ์ธต ๋” ์˜ฌ๋ฆฌ๋ ค๋Š” ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค.
์ผ๋ถ€ ๋ชจ๋ธ๋งŒ ๊ณต๊ฐœํ•˜๊ณ  ์žˆ์œผ๋ฉฐ ๋ชจ๋ธ ํŒŒ์ผ์€ [์—ฌ๊ธฐ](https://github.com/KR-korcen/korcen-ml/tree/main/model)์—์„œ ํ™•์ธ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
๋” ๋งŽ์€ ๋ชจ๋ธ ํŒŒ์ผ๊ณผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋ฐ›๊ณ  ์‹ถ๋‹ค๋ฉด ๋ฌธ์˜์ฃผ์„ธ์š”.
| | ๋ฐ์ดํ„ฐ ๋ฌธ์žฅ์ˆ˜ |
|------|------|
| VDCNN(23.4.30) | 200,000๊ฐœ |
| VDCNN_KOGPT2(23.5.28) | 2,000,000๊ฐœ |
| VDCNN_LLAMA2(23.9.30) | 5,000,000๊ฐœ |
| VDCNN_LLAMA2_V2(24.1.29) | 10,000,000๊ฐœ |
ํ‚ค์›Œ๋“œ ๊ธฐ๋ฐ˜ ๊ธฐ์กด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ : [py version](https://github.com/KR-korcen/korcen), [ts version](https://github.com/KR-korcen/korcen.ts)
[์„œํฌํŠธ ๋””์Šค์ฝ”๋“œ ์„œ๋ฒ„](https://discord.gg/wyTU3ZQBPE)
## ๋ชจ๋ธ ๊ฒ€์ฆ
๋ฐ์ดํ„ฐ๋งˆ๋‹ค ์š•์„ค์˜ ๊ธฐ์ค€์ด ๋‹ฌ๋ผ ์˜ค์ฐจ๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฑธ ๊ฐ์•ˆํ•˜๊ณ  ํ™•์ธํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.
| | [korean-malicious-comments-dataset](https://github.com/ZIZUN/korean-malicious-comments-dataset) | [Curse-detection-data](https://github.com/2runo/Curse-detection-data) | [kmhas_korean_hate_speech](https://huggingface.co/datasets/jeanlee/kmhas_korean_hate_speech) | [Korean Extremist Website Womad Hate Speech Data](https://www.kaggle.com/datasets/captainnemo9292/korean-extremist-website-womad-hate-speech-data/data) |
|------|------|------|------|------|
| [korcen(v0.3.5)](https://github.com/KR-korcen/korcen) | 0.7121 | **0.8415** | 0.6800 | 0.6305 |
| VDCNN(23.4.30) | 0.6900 | 0.4885 | | 0.4885 |
| VDCNN_KOGPT2(23.6.15) | 0.7545 | 0.7824 | | 0.7055 |
| VDCNN_LLAMA2(23.9.30) | 0.7762 | 0.8104 | 0.7296 | V2๋กœ ๋Œ€์ฒด |
| VDCNN_LLAMA2_V2(24.1.29) | **0.8322** | 0.8410 | **0.7837** | **0.7120** |
| [badword_check](https://github.com/Nam-SW/badword_check)(23.10.1) | 0.5829 | 0.6761 | | |
| [CurseDetector](https://github.com/mangto/CurseDetector)(24.1.10) | 0.5679 | ์‹œ๊ฐ„์†Œ์š”๋กœ ํ…Œ์ŠคํŠธ ๋ธ”๊ฐ€ | | 0.5785 |
## example
```py
#py: 3.10, tf: 2.10
import tensorflow as tf
import numpy as np
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences
maxlen = 1000
model_path = 'vdcnn_model.h5'
tokenizer_path = "tokenizer.pickle"
model = tf.keras.models.load_model(model_path)
with open(tokenizer_path, "rb") as f:
tokenizer = pickle.load(f)
def preprocess_text(text):
text = text.lower()
return text
def predict_text(text):
sentence = preprocess_text(text)
encoded_sentence = tokenizer.encode_plus(sentence,
max_length=maxlen,
padding="max_length",
truncation=True)['input_ids']
sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
prediction = model.predict(sentence_seq)[0][0]
return prediction
while True:
text = input("Enter the sentence you want to test: ")
result = predict_text(text)
if result >= 0.5:
print("This sentence contains abusive language.")
else:
print("It's a normal sentence.")
```
## Maker
>Tanat
```
github: Tanat05
discord: Tanat05
email: tanat@tanat.kr
```