Instructions to use AuricErgeson/hate-speech-detector with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AuricErgeson/hate-speech-detector with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AuricErgeson/hate-speech-detector")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("AuricErgeson/hate-speech-detector") model = AutoModelForSequenceClassification.from_pretrained("AuricErgeson/hate-speech-detector") - Notebooks
- Google Colab
- Kaggle
roberta-hate-speech-detector
Twitter-native RoBERTa fine-tuned on 110K+ examples from four public hate speech datasets, augmented with targeted neo-Nazi codes and antisemitic dog whistles that general Twitter corpora miss.
Three output classes:
| ID | Label | Meaning |
|---|---|---|
| 0 | neither |
Clean — not hateful or offensive |
| 1 | offensive |
Crude, profane, or offensive — but not hate speech |
| 2 | hate_speech |
Hate speech — slurs, coded language, dog whistles, white nationalist symbols |
Quick Start
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="AuricErgeson/roberta-hate-speech-detector",
)
classifier("they control the media")
# [{'label': 'hate_speech', 'score': 0.87}]
classifier("this movie sucked ass")
# [{'label': 'offensive', 'score': 0.91}]
classifier("I really enjoyed the concert")
# [{'label': 'neither', 'score': 0.97}]
Batch inference:
texts = ["heil hitler", "what a stupid film", "the weather is nice today"]
results = classifier(texts)
Performance
Evaluated on a stratified held-out test set of 11,059 examples.
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| neither | 0.955 | 0.822 | 0.884 | 6,514 |
| offensive | 0.830 | 0.913 | 0.870 | 2,703 |
| hate_speech | 0.607 | 0.817 | 0.697 | 1,842 |
| weighted avg | 0.867 | 0.843 | 0.849 | 11,059 |
- Val weighted F1: 0.852 (best checkpoint, epoch 3)
- Test accuracy: 0.843
- Passed 7/8 probe cases (see Limitations)
Training Data
110,585 examples fused from four public datasets plus targeted augmentation:
| Dataset | Examples | What it covers |
|---|---|---|
| Davidson et al. 2017 | 24,783 | Explicit Twitter slurs and offensive language |
| ImplicitHate | 21,480 | Coded language and dog whistles |
| HateXplain | 19,229 | Multi-annotator hate speech with rationales |
| HateDay 2025 (English only) | 45,000 | Large-scale contemporary Twitter hate |
| Targeted augmentation | 93 | Neo-Nazi codes (1488, heil hitler, 14 words), antisemitic dog whistles (ZOG, "they control the media"), white nationalist phrases |
Label harmonization — all datasets mapped to the unified 3-class scheme. Training set oversampled to 21,621 examples per class (64,863 total) to counter the heavy neither majority.
Training Details
- Base model: cardiffnlp/twitter-roberta-base-2022-154m — RoBERTa trained on 154M tweets through 2022
- Epochs: 3 | LR: 2e-5 | Batch: 32 | Grad accum: 2 | Warmup: 200 steps
- Precision: bf16 | Hardware: NVIDIA A100 | Training time: ~6 minutes
- Framework: HuggingFace Transformers + Trainer API
One non-obvious implementation detail: cardiffnlp/twitter-roberta-base-2022-154m uses legacy TF-style LayerNorm parameter names (gamma/beta) that transformers ≥5.0 no longer maps automatically. The checkpoint weights were reloaded manually with gamma→weight and beta→bias renaming before training.
Limitations
- Bare numeric codes —
"1488"as a standalone string is predictedneither. The model correctly classifies it in context ("1488 white power") but the 4-digit code alone is below threshold. This is the only known failed probe case. - Post-2022 slang — base model training data ends 2022; emerging coded language coined after that date may not be recognized.
- Context dependence — academic discussion of hate speech ("researchers studying the n-word...") and quotation of hateful phrases to condemn them can produce false positives. The model sees tokens, not intent.
- English only — trained exclusively on English-language data. Do not use on multilingual content.
- Short bare tokens — very short inputs (1–3 tokens) are unreliable; the model needs minimal context.
Intended Use
- Content moderation pipelines needing a fast, lightweight classifier
- Research on hate speech detection and NLP fairness
- Dataset annotation assistance (human-in-the-loop review)
- Building safety filters for social media applications
Out-of-Scope Use
This model should not be used:
- As the sole arbiter for automated account suspension or content removal without human review
- To profile or surveil individuals or communities
- In contexts outside English-language social media text
- As a substitute for human moderators on high-stakes decisions
Citation
If you use this model in research, please cite the four underlying datasets:
Davidson et al. (2017) — Automated Hate Speech Detection and the Problem of Offensive Language
El-Shirbeeny et al. (2021) — ImplicitHate
Mathew et al. (2021) — HateXplain
Tonneau et al. (2025) — HateDay
Live Demo
https://huggingface.co/spaces/AuricErgeson/hate-speech-detector
- Downloads last month
- 16
Datasets used to train AuricErgeson/hate-speech-detector
Hate-speech-CNERG/hatexplain
tasksource/implicit-hate-stg1
Space using AuricErgeson/hate-speech-detector 1
Evaluation results
- Weighted F1 on held-out test set (stratified 10% split)self-reported0.849
- Accuracy on held-out test set (stratified 10% split)self-reported0.843