--- language: en tags: - text-classification - hate-speech - twitter - knn - sklearn datasets: - hate_speech_offensive metrics: - f1 library_name: sklearn --- # Hate Speech Detector — KNN Pipeline KNN classifier cho bài toán phân loại hate speech trên Twitter. ## Labels - **0 — Hate Speech**: ngôn ngữ thù ghét - **1 — Offensive**: xúc phạm nhưng không phải hate speech - **2 — Neither**: bình thường ## Pipeline - TF-IDF (15k features) + Chi2 selection (top 5000) - Sentence Embeddings: `all-MiniLM-L6-v2` (384 chiều) - Meta features: word count, uppercase ratio, mention count, v.v. - KNN (k=3, euclidean, distance-weighted, BallTree) - Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit) ## Kết quả | Metric | Score | |--------|-------| | Accuracy | 0.8574 | | Macro F1 | 0.6396 | | Weighted F1 | 0.8437 | ## Load pipeline ```python import joblib from huggingface_hub import hf_hub_download path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl") pipeline = joblib.load(path) # Predict knn = pipeline['knn'] # (cần chạy feature extraction trước — xem gradio_demo.py) ```