--- language: - en license: mit tags: - text-classification - hate-speech-detection - content-moderation - nlp - roberta - twitter - safety - offensive-language - transformers - pytorch datasets: - tdavidson/hate_speech_offensive - tasksource/implicit-hate-stg1 - Hate-speech-CNERG/hatexplain - manueltonneau/hateday metrics: - f1 - accuracy pipeline_tag: text-classification model-index: - name: roberta-hate-speech-detector results: - task: type: text-classification name: Hate Speech Detection dataset: name: held-out test set (stratified 10% split) type: mixed metrics: - type: f1 value: 0.8489 name: Weighted F1 - type: accuracy value: 0.8434 name: Accuracy --- # roberta-hate-speech-detector Twitter-native RoBERTa fine-tuned on 110K+ examples from four public hate speech datasets, augmented with targeted neo-Nazi codes and antisemitic dog whistles that general Twitter corpora miss. Three output classes: | ID | Label | Meaning | |----|-------|---------| | 0 | `neither` | Clean — not hateful or offensive | | 1 | `offensive` | Crude, profane, or offensive — but not hate speech | | 2 | `hate_speech` | Hate speech — slurs, coded language, dog whistles, white nationalist symbols | ## Quick Start ```python from transformers import pipeline classifier = pipeline( "text-classification", model="AuricErgeson/roberta-hate-speech-detector", ) classifier("they control the media") # [{'label': 'hate_speech', 'score': 0.87}] classifier("this movie sucked ass") # [{'label': 'offensive', 'score': 0.91}] classifier("I really enjoyed the concert") # [{'label': 'neither', 'score': 0.97}] ``` Batch inference: ```python texts = ["heil hitler", "what a stupid film", "the weather is nice today"] results = classifier(texts) ``` ## Performance Evaluated on a stratified held-out test set of **11,059 examples**. | Class | Precision | Recall | F1 | Support | |-------|-----------|--------|----|---------| | neither | 0.955 | 0.822 | 0.884 | 6,514 | | offensive | 0.830 | 0.913 | 0.870 | 2,703 | | hate_speech | 0.607 | 0.817 | 0.697 | 1,842 | | **weighted avg** | **0.867** | **0.843** | **0.849** | **11,059** | - Val weighted F1: **0.852** (best checkpoint, epoch 3) - Test accuracy: **0.843** - Passed 7/8 probe cases (see Limitations) ## Training Data 110,585 examples fused from four public datasets plus targeted augmentation: | Dataset | Examples | What it covers | |---------|----------|----------------| | [Davidson et al. 2017](https://huggingface.co/datasets/tdavidson/hate_speech_offensive) | 24,783 | Explicit Twitter slurs and offensive language | | [ImplicitHate](https://huggingface.co/datasets/tasksource/implicit-hate-stg1) | 21,480 | Coded language and dog whistles | | [HateXplain](https://huggingface.co/datasets/Hate-speech-CNERG/hatexplain) | 19,229 | Multi-annotator hate speech with rationales | | [HateDay 2025](https://huggingface.co/datasets/manueltonneau/hateday) (English only) | 45,000 | Large-scale contemporary Twitter hate | | Targeted augmentation | 93 | Neo-Nazi codes (1488, heil hitler, 14 words), antisemitic dog whistles (ZOG, "they control the media"), white nationalist phrases | **Label harmonization** — all datasets mapped to the unified 3-class scheme. Training set oversampled to 21,621 examples per class (64,863 total) to counter the heavy `neither` majority. ## Training Details - **Base model:** [cardiffnlp/twitter-roberta-base-2022-154m](https://huggingface.co/cardiffnlp/twitter-roberta-base-2022-154m) — RoBERTa trained on 154M tweets through 2022 - **Epochs:** 3 | **LR:** 2e-5 | **Batch:** 32 | **Grad accum:** 2 | **Warmup:** 200 steps - **Precision:** bf16 | **Hardware:** NVIDIA A100 | **Training time:** ~6 minutes - **Framework:** HuggingFace Transformers + Trainer API One non-obvious implementation detail: `cardiffnlp/twitter-roberta-base-2022-154m` uses legacy TF-style LayerNorm parameter names (`gamma`/`beta`) that transformers ≥5.0 no longer maps automatically. The checkpoint weights were reloaded manually with `gamma→weight` and `beta→bias` renaming before training. ## Limitations - **Bare numeric codes** — `"1488"` as a standalone string is predicted `neither`. The model correctly classifies it in context ("1488 white power") but the 4-digit code alone is below threshold. This is the only known failed probe case. - **Post-2022 slang** — base model training data ends 2022; emerging coded language coined after that date may not be recognized. - **Context dependence** — academic discussion of hate speech ("researchers studying the n-word...") and quotation of hateful phrases to condemn them can produce false positives. The model sees tokens, not intent. - **English only** — trained exclusively on English-language data. Do not use on multilingual content. - **Short bare tokens** — very short inputs (1–3 tokens) are unreliable; the model needs minimal context. ## Intended Use - Content moderation pipelines needing a fast, lightweight classifier - Research on hate speech detection and NLP fairness - Dataset annotation assistance (human-in-the-loop review) - Building safety filters for social media applications ## Out-of-Scope Use This model **should not** be used: - As the sole arbiter for automated account suspension or content removal without human review - To profile or surveil individuals or communities - In contexts outside English-language social media text - As a substitute for human moderators on high-stakes decisions ## Citation If you use this model in research, please cite the four underlying datasets: ``` Davidson et al. (2017) — Automated Hate Speech Detection and the Problem of Offensive Language El-Shirbeeny et al. (2021) — ImplicitHate Mathew et al. (2021) — HateXplain Tonneau et al. (2025) — HateDay ``` ## Live Demo https://huggingface.co/spaces/AuricErgeson/hate-speech-detector