hate-speech-knn / README.md
Merikatori's picture
Upload README.md with huggingface_hub
48da132 verified
metadata
language: en
tags:
  - text-classification
  - hate-speech
  - twitter
  - knn
  - sklearn
datasets:
  - hate_speech_offensive
metrics:
  - f1
library_name: sklearn

Hate Speech Detector — KNN Pipeline

KNN classifier cho bài toán phân loại hate speech trên Twitter.

Labels

  • 0 — Hate Speech: ngôn ngữ thù ghét
  • 1 — Offensive: xúc phạm nhưng không phải hate speech
  • 2 — Neither: bình thường

Pipeline

  • TF-IDF (15k features) + Chi2 selection (top 5000)
  • Sentence Embeddings: all-MiniLM-L6-v2 (384 chiều)
  • Meta features: word count, uppercase ratio, mention count, v.v.
  • KNN (k=3, euclidean, distance-weighted, BallTree)
  • Imbalance: sample_weight='balanced' (không ADASYN — tránh overfit)

Kết quả

Metric Score
Accuracy 0.8574
Macro F1 0.6396
Weighted F1 0.8437

Load pipeline

import joblib
from huggingface_hub import hf_hub_download

path = hf_hub_download(repo_id="Merikatori/hate-speech-knn", filename="knn_pipeline.pkl")
pipeline = joblib.load(path)

# Predict
knn   = pipeline['knn']
# (cần chạy feature extraction trước — xem gradio_demo.py)