File size: 1,183 Bytes
48da132
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language: en
tags:
- text-classification
- hate-speech
- twitter
- knn
- sklearn
datasets:
- hate_speech_offensive
metrics:
- f1
library_name: sklearn
---

# Hate Speech Detector — KNN Pipeline

KNN classifier cho bài toán phân loại hate speech trên Twitter.

## Labels
- **0 — Hate Speech**: ngôn ngữ thù ghét
- **1 — Offensive**: xúc phạm nhưng không phải hate speech
- **2 — Neither**: bình thường

## Pipeline
- TF-IDF (15k features) + Chi2 selection (top 5000)
- Sentence Embeddings: `all-MiniLM-L6-v2` (384 chiều)
- Meta features: word count, uppercase ratio, mention count, v.v.
- KNN (k=3, euclidean, distance-weighted, BallTree)
- Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)

## Kết quả
| Metric | Score |
|--------|-------|
| Accuracy | 0.8574 |
| Macro F1 | 0.6396 |
| Weighted F1 | 0.8437 |

## Load pipeline
```python
import joblib
from huggingface_hub import hf_hub_download

path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
pipeline = joblib.load(path)

# Predict
knn   = pipeline['knn']
# (cần chạy feature extraction trước — xem gradio_demo.py)
```