Merikatori
/

hate-speech-knn

+---
+language: en
+tags:
+- text-classification
+- hate-speech
+- twitter
+- knn
+- sklearn
+datasets:
+- hate_speech_offensive
+metrics:
+- f1
+library_name: sklearn
+---
+# Hate Speech Detector — KNN Pipeline
+KNN classifier cho bài toán phân loại hate speech trên Twitter.
+## Labels
+- **0 — Hate Speech**: ngôn ngữ thù ghét
+- **1 — Offensive**: xúc phạm nhưng không phải hate speech
+- **2 — Neither**: bình thường
+## Pipeline
+- TF-IDF (15k features) + Chi2 selection (top 5000)
+- Sentence Embeddings: `all-MiniLM-L6-v2` (384 chiều)
+- Meta features: word count, uppercase ratio, mention count, v.v.
+- KNN (k=3, euclidean, distance-weighted, BallTree)
+- Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)
+## Kết quả
+| Metric | Score |
+|--------|-------|
+| Accuracy | 0.8574 |
+| Macro F1 | 0.6396 |
+| Weighted F1 | 0.8437 |
+## Load pipeline
+```python
+import joblib
+from huggingface_hub import hf_hub_download
+path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
+pipeline = joblib.load(path)
+# Predict
+knn   = pipeline['knn']
+# (cần chạy feature extraction trước — xem gradio_demo.py)
+```