Update README.md

1bb377b verified 7 days ago

5.95 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-classification
	- hate-speech-detection
	- content-moderation
	- nlp
	- roberta
	- twitter
	- safety
	- offensive-language
	- transformers
	- pytorch
	datasets:
	- tdavidson/hate_speech_offensive
	- tasksource/implicit-hate-stg1
	- Hate-speech-CNERG/hatexplain
	- manueltonneau/hateday
	metrics:
	- f1
	- accuracy
	pipeline_tag: text-classification
	model-index:
	- name: roberta-hate-speech-detector
	results:
	- task:
	type: text-classification
	name: Hate Speech Detection
	dataset:
	name: held-out test set (stratified 10% split)
	type: mixed
	metrics:
	- type: f1
	value: 0.8489
	name: Weighted F1
	- type: accuracy
	value: 0.8434
	name: Accuracy
	---

	# roberta-hate-speech-detector

	Twitter-native RoBERTa fine-tuned on 110K+ examples from four public hate speech datasets, augmented with targeted neo-Nazi codes and antisemitic dog whistles that general Twitter corpora miss.

	Three output classes:

	\| ID \| Label \| Meaning \|
	\|----\|-------\|---------\|
	\| 0 \| `neither` \| Clean — not hateful or offensive \|
	\| 1 \| `offensive` \| Crude, profane, or offensive — but not hate speech \|
	\| 2 \| `hate_speech` \| Hate speech — slurs, coded language, dog whistles, white nationalist symbols \|

	## Quick Start

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="AuricErgeson/roberta-hate-speech-detector",
	)

	classifier("they control the media")
	# [{'label': 'hate_speech', 'score': 0.87}]

	classifier("this movie sucked ass")
	# [{'label': 'offensive', 'score': 0.91}]

	classifier("I really enjoyed the concert")
	# [{'label': 'neither', 'score': 0.97}]
	```

	Batch inference:

	```python
	texts = ["heil hitler", "what a stupid film", "the weather is nice today"]
	results = classifier(texts)
	```

	## Performance

	Evaluated on a stratified held-out test set of 11,059 examples.

	\| Class \| Precision \| Recall \| F1 \| Support \|
	\|-------\|-----------\|--------\|----\|---------\|
	\| neither \| 0.955 \| 0.822 \| 0.884 \| 6,514 \|
	\| offensive \| 0.830 \| 0.913 \| 0.870 \| 2,703 \|
	\| hate_speech \| 0.607 \| 0.817 \| 0.697 \| 1,842 \|
	\| weighted avg \| 0.867 \| 0.843 \| 0.849 \| 11,059 \|

	- Val weighted F1: 0.852 (best checkpoint, epoch 3)
	- Test accuracy: 0.843
	- Passed 7/8 probe cases (see Limitations)

	## Training Data

	110,585 examples fused from four public datasets plus targeted augmentation:

	\| Dataset \| Examples \| What it covers \|
	\|---------\|----------\|----------------\|
	\| [Davidson et al. 2017](https://huggingface.co/datasets/tdavidson/hate_speech_offensive) \| 24,783 \| Explicit Twitter slurs and offensive language \|
	\| [ImplicitHate](https://huggingface.co/datasets/tasksource/implicit-hate-stg1) \| 21,480 \| Coded language and dog whistles \|
	\| [HateXplain](https://huggingface.co/datasets/Hate-speech-CNERG/hatexplain) \| 19,229 \| Multi-annotator hate speech with rationales \|
	\| [HateDay 2025](https://huggingface.co/datasets/manueltonneau/hateday) (English only) \| 45,000 \| Large-scale contemporary Twitter hate \|
	\| Targeted augmentation \| 93 \| Neo-Nazi codes (1488, heil hitler, 14 words), antisemitic dog whistles (ZOG, "they control the media"), white nationalist phrases \|

	Label harmonization — all datasets mapped to the unified 3-class scheme. Training set oversampled to 21,621 examples per class (64,863 total) to counter the heavy `neither` majority.

	## Training Details

	- Base model: [cardiffnlp/twitter-roberta-base-2022-154m](https://huggingface.co/cardiffnlp/twitter-roberta-base-2022-154m) — RoBERTa trained on 154M tweets through 2022
	- Epochs: 3 \| LR: 2e-5 \| Batch: 32 \| Grad accum: 2 \| Warmup: 200 steps
	- Precision: bf16 \| Hardware: NVIDIA A100 \| Training time: ~6 minutes
	- Framework: HuggingFace Transformers + Trainer API

	One non-obvious implementation detail: `cardiffnlp/twitter-roberta-base-2022-154m` uses legacy TF-style LayerNorm parameter names (`gamma`/`beta`) that transformers ≥5.0 no longer maps automatically. The checkpoint weights were reloaded manually with `gamma→weight` and `beta→bias` renaming before training.

	## Limitations

	- Bare numeric codes — `"1488"` as a standalone string is predicted `neither`. The model correctly classifies it in context ("1488 white power") but the 4-digit code alone is below threshold. This is the only known failed probe case.
	- Post-2022 slang — base model training data ends 2022; emerging coded language coined after that date may not be recognized.
	- Context dependence — academic discussion of hate speech ("researchers studying the n-word...") and quotation of hateful phrases to condemn them can produce false positives. The model sees tokens, not intent.
	- English only — trained exclusively on English-language data. Do not use on multilingual content.
	- Short bare tokens — very short inputs (1–3 tokens) are unreliable; the model needs minimal context.

	## Intended Use

	- Content moderation pipelines needing a fast, lightweight classifier
	- Research on hate speech detection and NLP fairness
	- Dataset annotation assistance (human-in-the-loop review)
	- Building safety filters for social media applications

	## Out-of-Scope Use

	This model should not be used:
	- As the sole arbiter for automated account suspension or content removal without human review
	- To profile or surveil individuals or communities
	- In contexts outside English-language social media text
	- As a substitute for human moderators on high-stakes decisions

	## Citation

	If you use this model in research, please cite the four underlying datasets:

	```
	Davidson et al. (2017) — Automated Hate Speech Detection and the Problem of Offensive Language
	El-Shirbeeny et al. (2021) — ImplicitHate
	Mathew et al. (2021) — HateXplain
	Tonneau et al. (2025) — HateDay
	```
	## Live Demo

	https://huggingface.co/spaces/AuricErgeson/hate-speech-detector