Tanat05
/

korcen

Model card Files Files and versions

korcen / README.md

Tanat05's picture

Update README.md

dec8dec verified almost 2 years ago

|

history blame contribute delete

3.46 kB

	---
	license: apache-2.0
	language:
	- ko
	---
	<div align="center">
	<h1>Korcen</h1>
	</div>

	![131_20220604170616](https://user-images.githubusercontent.com/85154556/171998341-9a7439c8-122f-4a9f-beb6-0e0b3aad05ed.png)

	korcen-ml은 기존 키워드 기반의 korcen의 우회가 쉽다는 단점을 극복하기위해 딥러닝을 통해 정확도를 한층 더 올리려는 프로젝트입니다.

	일부 모델만 공개하고 있으며 모델 파일은 [여기](https://github.com/KR-korcen/korcen-ml/tree/main/model)에서 확인이 가능합니다.

	더 많은 모델 파일과 학습 데이터를 다운받고 싶다면 문의주세요.

	\| \| 데이터 문장수 \|
	\|------\|------\|
	\| VDCNN(23.4.30) \| 200,000개 \|
	\| VDCNN_KOGPT2(23.5.28) \| 2,000,000개 \|
	\| VDCNN_LLAMA2(23.9.30) \| 5,000,000개 \|
	\| VDCNN_LLAMA2_V2(24.1.29) \| 10,000,000개 \|


	키워드 기반 기존 라이브러리 : [py version](https://github.com/KR-korcen/korcen), [ts version](https://github.com/KR-korcen/korcen.ts)

	[서포트 디스코드 서버](https://discord.gg/wyTU3ZQBPE)

	## 모델 검증
	데이터마다 욕설의 기준이 달라 오차가 있다는 걸 감안하고 확인하시기 바랍니다.


	\| \| [korean-malicious-comments-dataset](https://github.com/ZIZUN/korean-malicious-comments-dataset) \| [Curse-detection-data](https://github.com/2runo/Curse-detection-data) \| [kmhas_korean_hate_speech](https://huggingface.co/datasets/jeanlee/kmhas_korean_hate_speech) \| [Korean Extremist Website Womad Hate Speech Data](https://www.kaggle.com/datasets/captainnemo9292/korean-extremist-website-womad-hate-speech-data/data) \|
	\|------\|------\|------\|------\|------\|
	\| [korcen(v0.3.5)](https://github.com/KR-korcen/korcen) \| 0.7121 \| 0.8415 \| 0.6800 \| 0.6305 \|
	\| VDCNN(23.4.30) \| 0.6900 \| 0.4885 \| \| 0.4885 \|
	\| VDCNN_KOGPT2(23.6.15) \| 0.7545 \| 0.7824 \| \| 0.7055 \|
	\| VDCNN_LLAMA2(23.9.30) \| 0.7762 \| 0.8104 \| 0.7296 \| V2로 대체 \|
	\| VDCNN_LLAMA2_V2(24.1.29) \| 0.8322 \| 0.8410 \| 0.7837 \| 0.7120 \|
	\| [badword_check](https://github.com/Nam-SW/badword_check)(23.10.1) \| 0.5829 \| 0.6761 \| \| \|
	\| [CurseDetector](https://github.com/mangto/CurseDetector)(24.1.10) \| 0.5679 \| 시간소요로 테스트 블가 \| \| 0.5785 \|

	## example
	```py
	#py: 3.10, tf: 2.10
	import tensorflow as tf
	import numpy as np
	import pickle
	from tensorflow.keras.preprocessing.sequence import pad_sequences

	maxlen = 1000

	model_path = 'vdcnn_model.h5'
	tokenizer_path = "tokenizer.pickle"

	model = tf.keras.models.load_model(model_path)
	with open(tokenizer_path, "rb") as f:
	tokenizer = pickle.load(f)

	def preprocess_text(text):
	text = text.lower()

	return text

	def predict_text(text):
	sentence = preprocess_text(text)
	encoded_sentence = tokenizer.encode_plus(sentence,
	max_length=maxlen,
	padding="max_length",
	truncation=True)['input_ids']
	sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
	prediction = model.predict(sentence_seq)[0][0]
	return prediction

	while True:
	text = input("Enter the sentence you want to test: ")
	result = predict_text(text)
	if result >= 0.5:
	print("This sentence contains abusive language.")
	else:
	print("It's a normal sentence.")
	```


	## Maker


	>Tanat
	```
	github: Tanat05
	discord: Tanat05
	email: tanat@tanat.kr
	```