Merikatori
/

hate-speech-knn

Text Classification

Model card Files Files and versions

Hate Speech Detector — KNN Pipeline

KNN classifier cho bài toán phân loại hate speech trên Twitter.

Labels

0 — Hate Speech: ngôn ngữ thù ghét
1 — Offensive: xúc phạm nhưng không phải hate speech
2 — Neither: bình thường

Pipeline

TF-IDF (15k features) + Chi2 selection (top 5000)
Sentence Embeddings: all-MiniLM-L6-v2 (384 chiều)
Meta features: word count, uppercase ratio, mention count, v.v.
KNN (k=3, euclidean, distance-weighted, BallTree)
Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)

Kết quả

Metric	Score
Accuracy	0.8574
Macro F1	0.6396
Weighted F1	0.8437

Load pipeline

import joblib
from huggingface_hub import hf_hub_download

path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
pipeline = joblib.load(path)

# Predict
knn   = pipeline['knn']
# (cần chạy feature extraction trước — xem gradio_demo.py)

Downloads last month: -

Dataset used to train Merikatori/hate-speech-knn