Tanat05 commited on
Commit
fc816c3
·
verified ·
1 Parent(s): 7ef0c29

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md CHANGED
@@ -1,3 +1,95 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - ko
5
  ---
6
+ <div align="center">
7
+ <h1>Korcen</h1>
8
+ </div>
9
+
10
+ ![131_20220604170616](https://user-images.githubusercontent.com/85154556/171998341-9a7439c8-122f-4a9f-beb6-0e0b3aad05ed.png)
11
+
12
+ korcen-ml은 기존 키워드 기반의 korcen의 우회가 쉽다는 단점을 극복하기위해 딥러닝을 통해 정확도를 한층 더 올리려는 프로젝트입니다.
13
+
14
+ 일부 모델만 공개하고 있으며 모델 파일은 [여기](https://github.com/KR-korcen/korcen-ml/tree/main/model)에서 확인이 가능합니다.
15
+
16
+ 더 많은 모델 파일과 학습 데이터를 다운받고 싶다면 문의주세요.
17
+
18
+ | | 데이터 문장수 |
19
+ |------|------|
20
+ | VDCNN(23.4.30) | 200,000개 |
21
+ | VDCNN_KOGPT2(23.5.28) | 2,000,000개 |
22
+ | VDCNN_LLAMA2(23.9.30) | 5,000,000개 |
23
+ | VDCNN_LLAMA2_V2(24.1.29) | 10,000,000개 |
24
+
25
+
26
+ 키워드 기반 기존 라이브러리 : [py version](https://github.com/KR-korcen/korcen), [ts version](https://github.com/KR-korcen/korcen.ts)
27
+
28
+ [서포트 디스코드 서버](https://discord.gg/wyTU3ZQBPE)
29
+
30
+ ## 모델 검증
31
+ 데이터마다 욕설의 기준이 달라 오차가 있다는 걸 감안하고 확인하시기 바랍니다.
32
+
33
+
34
+ | | [korean-malicious-comments-dataset](https://github.com/ZIZUN/korean-malicious-comments-dataset) | [Curse-detection-data](https://github.com/2runo/Curse-detection-data) | [kmhas_korean_hate_speech](https://huggingface.co/datasets/jeanlee/kmhas_korean_hate_speech) | [Korean Extremist Website Womad Hate Speech Data](https://www.kaggle.com/datasets/captainnemo9292/korean-extremist-website-womad-hate-speech-data/data) |
35
+ |------|------|------|------|------|
36
+ | [korcen(v0.3.5)](https://github.com/KR-korcen/korcen) | 0.7121 | **0.8415** | 0.6800 | 0.6305 |
37
+ | VDCNN(23.4.30) | 0.6900 | 0.4885 | | 0.4885 |
38
+ | VDCNN_KOGPT2(23.6.15) | 0.7545 | 0.7824 | | 0.7055 |
39
+ | VDCNN_LLAMA2(23.9.30) | 0.7762 | 0.8104 | 0.7296 | V2로 대체 |
40
+ | VDCNN_LLAMA2_V2(24.1.29) | **0.8322** | 0.8410 | **0.7837** | **0.7120** |
41
+ | [badword_check](https://github.com/Nam-SW/badword_check)(23.10.1) | 0.5829 | 0.6761 | | |
42
+ | [CurseDetector](https://github.com/mangto/CurseDetector)(24.1.10) | 0.5679 | 시간소요로 테스트 블가 | | 0.5785 |
43
+
44
+ ## example
45
+ ```py
46
+ #py: 3.10, tf: 2.10
47
+ #kogpt2
48
+ import tensorflow as tf
49
+ import numpy as np
50
+ import pickle
51
+ from tensorflow.keras.preprocessing.sequence import pad_sequences
52
+
53
+ maxlen = 1000
54
+
55
+ model_path = 'vdcnn_model.h5'
56
+ tokenizer_path = "tokenizer.pickle"
57
+
58
+ model = tf.keras.models.load_model(model_path)
59
+ with open(tokenizer_path, "rb") as f:
60
+ tokenizer = pickle.load(f)
61
+
62
+ def preprocess_text(text):
63
+ text = text.lower()
64
+
65
+ return text
66
+
67
+ def predict_text(text):
68
+ sentence = preprocess_text(text)
69
+ encoded_sentence = tokenizer.encode_plus(sentence,
70
+ max_length=maxlen,
71
+ padding="max_length",
72
+ truncation=True)['input_ids']
73
+ sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
74
+ prediction = model.predict(sentence_seq)[0][0]
75
+ return prediction
76
+
77
+ while True:
78
+ text = input("Enter the sentence you want to test: ")
79
+ result = predict_text(text)
80
+ if result >= 0.5:
81
+ print("This sentence contains abusive language.")
82
+ else:
83
+ print("It's a normal sentence.")
84
+ ```
85
+
86
+
87
+ ## Maker
88
+
89
+
90
+ >Tanat
91
+ ```
92
+ github: Tanat05
93
+ discord: Tanat05
94
+ email: tanat@tanat.kr
95
+ ```