Guardian β Dialect-Fair Hate Classifier (ModernBERT-large)
A binary hate-speech classifier fine-tuned from ModernBERT-large, explicitly debiased against African-American English (AAE) false positives.
Off-the-shelf toxicity/hate classifiers flag benign AAE text as hateful at 2x+ the rate of benign General-American English (Sap et al. 2019). Guardian targets and measures that gap directly: it catches 93% of hate while keeping the benign-AAE false-positive rate within ~0.04 of the benign-GAE rate.
Author: Guy Grigsby (Aeryx-ai). License: MIT.
Labels
0 = not hate, 1 = hate. Output the softmax probability of class 1 and threshold it. The policy is targeted hate (slur-as-attack, protected-group harassment) β profanity, insults, dark themes, sexual content, and political opinion are deliberately not flagged.
Operating frontier
Evaluated on HateCheck (recall, FP) and a held-out dialect-balanced benign set (FP on high-AAE vs low-AAE text, dialect measured with TwitterAAE). Pick a threshold from the curve:
| threshold | recall | HateCheck FP | high-AAE benign FP | low-AAE benign FP | dialect FP gap |
|---|---|---|---|---|---|
| 0.50 | 0.934 | 0.013 | 0.082 | 0.040 | 0.042 |
| 0.80 | 0.906 | 0.007 | 0.055 | 0.025 | 0.030 |
| 0.90 | 0.872 | 0.006 | 0.041 | 0.015 | 0.027 |
The dialect FP gap is near-parity across the whole frontier (raw bias is ~5x lower than the ModernBERT-large baseline's 0.21 gap). Default to 0.50 for max recall; raise toward 0.80 for extra parity margin (e.g. corpus filtering).
Usage
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tok = AutoTokenizer.from_pretrained("Aeryx-ai/guardian-dialect-fair-hate")
model = AutoModelForSequenceClassification.from_pretrained("Aeryx-ai/guardian-dialect-fair-hate").eval()
def hate_prob(text):
enc = tok(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
return torch.softmax(model(**enc).logits, -1)[0, 1].item()
is_hate = hate_prob("...") >= 0.5
How it was debiased
- Bias-aware training sources β Measuring Hate Speech (UC Berkeley), ToxiGen, DynaHate, Civil Comments (using
identity_attackto separate group-hate from mere toxicity). The canonically AAE-biased Davidson/Founta sets are excluded as label sources. - High-AAE-safe augmentation β ~14k benign high-AAE examples (mined from social text, TwitterAAE-filtered) so the model cannot use dialect as a hate cue.
- FP-parity gate β release was conditioned on benign high-AAE FP β benign low-AAE FP on a held-out set (not just overall F1). The eval set is held out from the augmentation distribution.
Loss reweighting and DANN-style adversarial debiasing were tried; data volume of high-AAE-safe text was what actually closed the gap.
Limitations
- English only. TwitterAAE is a Twitter-domain distant-supervision proxy for dialect, not ground truth, and is noisier on long-form/other-domain text.
- Recall/parity tradeoff is explicit in the frontier β higher thresholds trade recall for margin.
- Not for high-stakes automated decisions about individuals. Intended for content moderation triage and training-corpus filtering, with humans in the loop.
- Trained on a "targeted hate" policy; it will not flag profanity or controversy by design.
Citation
@misc{grigsby2026guardian,
title = {Guardian: A Dialect-Fair Hate Classifier},
author = {Grigsby, Guy},
year = {2026},
howpublished = {\url{https://huggingface.co/Aeryx-ai/guardian-dialect-fair-hate}}
}
References: Sap et al. 2019 (Risk of Racial Bias in Hate Speech Detection); Blodgett et al. 2016 (TwitterAAE); Kennedy et al. (Measuring Hate Speech); Hartvigsen et al. (ToxiGen); Vidgen et al. 2021 (DynaHate).
- Downloads last month
- 26
Model tree for Aeryx-ai/guardian-dialect-fair-hate
Base model
answerdotai/ModernBERT-large