Hate Speech Detector — KNN Pipeline
KNN classifier cho bài toán phân loại hate speech trên Twitter.
Labels
- 0 — Hate Speech: ngôn ngữ thù ghét
- 1 — Offensive: xúc phạm nhưng không phải hate speech
- 2 — Neither: bình thường
Pipeline
- TF-IDF (15k features) + Chi2 selection (top 5000)
- Sentence Embeddings:
all-MiniLM-L6-v2(384 chiều) - Meta features: word count, uppercase ratio, mention count, v.v.
- KNN (k=3, euclidean, distance-weighted, BallTree)
- Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)
Kết quả
| Metric | Score |
|---|---|
| Accuracy | 0.8574 |
| Macro F1 | 0.6396 |
| Weighted F1 | 0.8437 |
Load pipeline
import joblib
from huggingface_hub import hf_hub_download
path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
pipeline = joblib.load(path)
# Predict
knn = pipeline['knn']
# (cần chạy feature extraction trước — xem gradio_demo.py)
- Downloads last month
- -