Merikatori
/

hate-speech-knn

Text Classification

Model card Files Files and versions

hate-speech-knn / README.md

Merikatori's picture

Upload README.md with huggingface_hub

48da132 verified 13 days ago

|

history blame contribute delete

1.18 kB

	---
	language: en
	tags:
	- text-classification
	- hate-speech
	- twitter
	- knn
	- sklearn
	datasets:
	- hate_speech_offensive
	metrics:
	- f1
	library_name: sklearn
	---

	# Hate Speech Detector — KNN Pipeline

	KNN classifier cho bài toán phân loại hate speech trên Twitter.

	## Labels
	- 0 — Hate Speech: ngôn ngữ thù ghét
	- 1 — Offensive: xúc phạm nhưng không phải hate speech
	- 2 — Neither: bình thường

	## Pipeline
	- TF-IDF (15k features) + Chi2 selection (top 5000)
	- Sentence Embeddings: `all-MiniLM-L6-v2` (384 chiều)
	- Meta features: word count, uppercase ratio, mention count, v.v.
	- KNN (k=3, euclidean, distance-weighted, BallTree)
	- Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)

	## Kết quả
	\| Metric \| Score \|
	\|--------\|-------\|
	\| Accuracy \| 0.8574 \|
	\| Macro F1 \| 0.6396 \|
	\| Weighted F1 \| 0.8437 \|

	## Load pipeline
	```python
	import joblib
	from huggingface_hub import hf_hub_download

	path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
	pipeline = joblib.load(path)

	# Predict
	knn = pipeline['knn']
	# (cần chạy feature extraction trước — xem gradio_demo.py)
	```