---
language: en
tags:
- text-classification
- hate-speech
- twitter
- knn
- sklearn
datasets:
- hate_speech_offensive
metrics:
- f1
library_name: sklearn
---

# Hate Speech Detector — KNN Pipeline

KNN classifier cho bài toán phân loại hate speech trên Twitter.

## Labels
- **0 — Hate Speech**: ngôn ngữ thù ghét
- **1 — Offensive**: xúc phạm nhưng không phải hate speech
- **2 — Neither**: bình thường

## Pipeline
- TF-IDF (15k features) + Chi2 selection (top 5000)
- Sentence Embeddings: `all-MiniLM-L6-v2` (384 chiều)
- Meta features: word count, uppercase ratio, mention count, v.v.
- KNN (k=3, euclidean, distance-weighted, BallTree)
- Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)

## Kết quả
| Metric | Score |
|--------|-------|
| Accuracy | 0.8574 |
| Macro F1 | 0.6396 |
| Weighted F1 | 0.8437 |

## Load pipeline
```python
import joblib
from huggingface_hub import hf_hub_download

path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
pipeline = joblib.load(path)

# Predict
knn   = pipeline['knn']
# (cần chạy feature extraction trước — xem gradio_demo.py)
```