| language: en | |
| tags: | |
| - text-classification | |
| - hate-speech | |
| - knn | |
| - sklearn | |
| datasets: | |
| - hate_speech_offensive | |
| metrics: | |
| - f1 | |
| library_name: sklearn | |
| # Hate Speech Detector — KNN Pipeline | |
| KNN classifier cho bài toán phân loại hate speech trên Twitter. | |
| ## Labels | |
| - **0 — Hate Speech**: ngôn ngữ thù ghét | |
| - **1 — Offensive**: xúc phạm nhưng không phải hate speech | |
| - **2 — Neither**: bình thường | |
| ## Pipeline | |
| - TF-IDF (15k features) + Chi2 selection (top 5000) | |
| - Sentence Embeddings: `all-MiniLM-L6-v2` (384 chiều) | |
| - Meta features: word count, uppercase ratio, mention count, v.v. | |
| - KNN (k=3, euclidean, distance-weighted, BallTree) | |
| - Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit) | |
| ## Kết quả | |
| | Metric | Score | | |
| |--------|-------| | |
| | Accuracy | 0.8574 | | |
| | Macro F1 | 0.6396 | | |
| | Weighted F1 | 0.8437 | | |
| ## Load pipeline | |
| ```python | |
| import joblib | |
| from huggingface_hub import hf_hub_download | |
| path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl") | |
| pipeline = joblib.load(path) | |
| # Predict | |
| knn = pipeline['knn'] | |
| # (cần chạy feature extraction trước — xem gradio_demo.py) | |
| ``` | |