roberta-hate-speech-detector

Twitter-native RoBERTa fine-tuned on 110K+ examples from four public hate speech datasets, augmented with targeted neo-Nazi codes and antisemitic dog whistles that general Twitter corpora miss.

Three output classes:

ID	Label	Meaning
0	`neither`	Clean — not hateful or offensive
1	`offensive`	Crude, profane, or offensive — but not hate speech
2	`hate_speech`	Hate speech — slurs, coded language, dog whistles, white nationalist symbols

Quick Start

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="AuricErgeson/roberta-hate-speech-detector",
)

classifier("they control the media")
# [{'label': 'hate_speech', 'score': 0.87}]

classifier("this movie sucked ass")
# [{'label': 'offensive', 'score': 0.91}]

classifier("I really enjoyed the concert")
# [{'label': 'neither', 'score': 0.97}]

Batch inference:

texts = ["heil hitler", "what a stupid film", "the weather is nice today"]
results = classifier(texts)

Performance

Evaluated on a stratified held-out test set of 11,059 examples.

Class	Precision	Recall	F1	Support
neither	0.955	0.822	0.884	6,514
offensive	0.830	0.913	0.870	2,703
hate_speech	0.607	0.817	0.697	1,842
weighted avg	0.867	0.843	0.849	11,059

Val weighted F1: 0.852 (best checkpoint, epoch 3)
Test accuracy: 0.843
Passed 7/8 probe cases (see Limitations)

Training Data

110,585 examples fused from four public datasets plus targeted augmentation:

Dataset	Examples	What it covers
Davidson et al. 2017	24,783	Explicit Twitter slurs and offensive language
ImplicitHate	21,480	Coded language and dog whistles
HateXplain	19,229	Multi-annotator hate speech with rationales
HateDay 2025 (English only)	45,000	Large-scale contemporary Twitter hate
Targeted augmentation	93	Neo-Nazi codes (1488, heil hitler, 14 words), antisemitic dog whistles (ZOG, "they control the media"), white nationalist phrases

Label harmonization — all datasets mapped to the unified 3-class scheme. Training set oversampled to 21,621 examples per class (64,863 total) to counter the heavy neither majority.

Training Details

Base model: cardiffnlp/twitter-roberta-base-2022-154m — RoBERTa trained on 154M tweets through 2022
Epochs: 3 | LR: 2e-5 | Batch: 32 | Grad accum: 2 | Warmup: 200 steps
Precision: bf16 | Hardware: NVIDIA A100 | Training time: ~6 minutes
Framework: HuggingFace Transformers + Trainer API

One non-obvious implementation detail: cardiffnlp/twitter-roberta-base-2022-154m uses legacy TF-style LayerNorm parameter names (gamma/beta) that transformers ≥5.0 no longer maps automatically. The checkpoint weights were reloaded manually with gamma→weight and beta→bias renaming before training.

Limitations

Bare numeric codes — "1488" as a standalone string is predicted neither. The model correctly classifies it in context ("1488 white power") but the 4-digit code alone is below threshold. This is the only known failed probe case.
Post-2022 slang — base model training data ends 2022; emerging coded language coined after that date may not be recognized.
Context dependence — academic discussion of hate speech ("researchers studying the n-word...") and quotation of hateful phrases to condemn them can produce false positives. The model sees tokens, not intent.
English only — trained exclusively on English-language data. Do not use on multilingual content.
Short bare tokens — very short inputs (1–3 tokens) are unreliable; the model needs minimal context.

Intended Use

Content moderation pipelines needing a fast, lightweight classifier
Research on hate speech detection and NLP fairness
Dataset annotation assistance (human-in-the-loop review)
Building safety filters for social media applications

Out-of-Scope Use

This model should not be used:

As the sole arbiter for automated account suspension or content removal without human review
To profile or surveil individuals or communities
In contexts outside English-language social media text
As a substitute for human moderators on high-stakes decisions

Citation

If you use this model in research, please cite the four underlying datasets:

Davidson et al. (2017) — Automated Hate Speech Detection and the Problem of Offensive Language
El-Shirbeeny et al. (2021) — ImplicitHate
Mathew et al. (2021) — HateXplain
Tonneau et al. (2025) — HateDay

Live Demo

https://huggingface.co/spaces/AuricErgeson/hate-speech-detector

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train AuricErgeson/hate-speech-detector

Space using AuricErgeson/hate-speech-detector 1

Evaluation results

Weighted F1 on held-out test set (stratified 10% split)
self-reported

0.849
Accuracy on held-out test set (stratified 10% split)
self-reported

0.843