hate-speech-knn / README.md
Merikatori's picture
Upload README.md with huggingface_hub
48da132 verified
---
language: en
tags:
- text-classification
- hate-speech
- twitter
- knn
- sklearn
datasets:
- hate_speech_offensive
metrics:
- f1
library_name: sklearn
---
# Hate Speech Detector — KNN Pipeline
KNN classifier cho bài toán phân loại hate speech trên Twitter.
## Labels
- **0 — Hate Speech**: ngôn ngữ thù ghét
- **1 — Offensive**: xúc phạm nhưng không phải hate speech
- **2 — Neither**: bình thường
## Pipeline
- TF-IDF (15k features) + Chi2 selection (top 5000)
- Sentence Embeddings: `all-MiniLM-L6-v2` (384 chiều)
- Meta features: word count, uppercase ratio, mention count, v.v.
- KNN (k=3, euclidean, distance-weighted, BallTree)
- Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)
## Kết quả
| Metric | Score |
|--------|-------|
| Accuracy | 0.8574 |
| Macro F1 | 0.6396 |
| Weighted F1 | 0.8437 |
## Load pipeline
```python
import joblib
from huggingface_hub import hf_hub_download
path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
pipeline = joblib.load(path)
# Predict
knn = pipeline['knn']
# (cần chạy feature extraction trước — xem gradio_demo.py)
```