roberta-hate-speech-detector

Twitter-native RoBERTa fine-tuned on 110K+ examples from four public hate speech datasets, augmented with targeted neo-Nazi codes and antisemitic dog whistles that general Twitter corpora miss.

Three output classes:

ID Label Meaning
0 neither Clean — not hateful or offensive
1 offensive Crude, profane, or offensive — but not hate speech
2 hate_speech Hate speech — slurs, coded language, dog whistles, white nationalist symbols

Quick Start

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="AuricErgeson/roberta-hate-speech-detector",
)

classifier("they control the media")
# [{'label': 'hate_speech', 'score': 0.87}]

classifier("this movie sucked ass")
# [{'label': 'offensive', 'score': 0.91}]

classifier("I really enjoyed the concert")
# [{'label': 'neither', 'score': 0.97}]

Batch inference:

texts = ["heil hitler", "what a stupid film", "the weather is nice today"]
results = classifier(texts)

Performance

Evaluated on a stratified held-out test set of 11,059 examples.

Class Precision Recall F1 Support
neither 0.955 0.822 0.884 6,514
offensive 0.830 0.913 0.870 2,703
hate_speech 0.607 0.817 0.697 1,842
weighted avg 0.867 0.843 0.849 11,059
  • Val weighted F1: 0.852 (best checkpoint, epoch 3)
  • Test accuracy: 0.843
  • Passed 7/8 probe cases (see Limitations)

Training Data

110,585 examples fused from four public datasets plus targeted augmentation:

Dataset Examples What it covers
Davidson et al. 2017 24,783 Explicit Twitter slurs and offensive language
ImplicitHate 21,480 Coded language and dog whistles
HateXplain 19,229 Multi-annotator hate speech with rationales
HateDay 2025 (English only) 45,000 Large-scale contemporary Twitter hate
Targeted augmentation 93 Neo-Nazi codes (1488, heil hitler, 14 words), antisemitic dog whistles (ZOG, "they control the media"), white nationalist phrases

Label harmonization — all datasets mapped to the unified 3-class scheme. Training set oversampled to 21,621 examples per class (64,863 total) to counter the heavy neither majority.

Training Details

  • Base model: cardiffnlp/twitter-roberta-base-2022-154m — RoBERTa trained on 154M tweets through 2022
  • Epochs: 3 | LR: 2e-5 | Batch: 32 | Grad accum: 2 | Warmup: 200 steps
  • Precision: bf16 | Hardware: NVIDIA A100 | Training time: ~6 minutes
  • Framework: HuggingFace Transformers + Trainer API

One non-obvious implementation detail: cardiffnlp/twitter-roberta-base-2022-154m uses legacy TF-style LayerNorm parameter names (gamma/beta) that transformers ≥5.0 no longer maps automatically. The checkpoint weights were reloaded manually with gamma→weight and beta→bias renaming before training.

Limitations

  • Bare numeric codes — "1488" as a standalone string is predicted neither. The model correctly classifies it in context ("1488 white power") but the 4-digit code alone is below threshold. This is the only known failed probe case.
  • Post-2022 slang — base model training data ends 2022; emerging coded language coined after that date may not be recognized.
  • Context dependence — academic discussion of hate speech ("researchers studying the n-word...") and quotation of hateful phrases to condemn them can produce false positives. The model sees tokens, not intent.
  • English only — trained exclusively on English-language data. Do not use on multilingual content.
  • Short bare tokens — very short inputs (1–3 tokens) are unreliable; the model needs minimal context.

Intended Use

  • Content moderation pipelines needing a fast, lightweight classifier
  • Research on hate speech detection and NLP fairness
  • Dataset annotation assistance (human-in-the-loop review)
  • Building safety filters for social media applications

Out-of-Scope Use

This model should not be used:

  • As the sole arbiter for automated account suspension or content removal without human review
  • To profile or surveil individuals or communities
  • In contexts outside English-language social media text
  • As a substitute for human moderators on high-stakes decisions

Citation

If you use this model in research, please cite the four underlying datasets:

Davidson et al. (2017) — Automated Hate Speech Detection and the Problem of Offensive Language
El-Shirbeeny et al. (2021) — ImplicitHate
Mathew et al. (2021) — HateXplain
Tonneau et al. (2025) — HateDay

Live Demo

https://huggingface.co/spaces/AuricErgeson/hate-speech-detector

Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Datasets used to train AuricErgeson/hate-speech-detector

Space using AuricErgeson/hate-speech-detector 1

Evaluation results

  • Weighted F1 on held-out test set (stratified 10% split)
    self-reported
    0.849
  • Accuracy on held-out test set (stratified 10% split)
    self-reported
    0.843