Guardian — Dialect-Fair Hate Classifier (ModernBERT-large)

A binary hate-speech classifier fine-tuned from ModernBERT-large, explicitly debiased against African-American English (AAE) false positives.

Off-the-shelf toxicity/hate classifiers flag benign AAE text as hateful at 2x+ the rate of benign General-American English (Sap et al. 2019). Guardian targets and measures that gap directly: it catches 93% of hate while keeping the benign-AAE false-positive rate within ~0.04 of the benign-GAE rate.

Author: Guy Grigsby (Aeryx-ai). License: MIT.

Labels

0 = not hate, 1 = hate. Output the softmax probability of class 1 and threshold it. The policy is targeted hate (slur-as-attack, protected-group harassment) — profanity, insults, dark themes, sexual content, and political opinion are deliberately not flagged.

Operating frontier

Evaluated on HateCheck (recall, FP) and a held-out dialect-balanced benign set (FP on high-AAE vs low-AAE text, dialect measured with TwitterAAE). Pick a threshold from the curve:

threshold	recall	HateCheck FP	high-AAE benign FP	low-AAE benign FP	dialect FP gap
0.50	0.934	0.013	0.082	0.040	0.042
0.80	0.906	0.007	0.055	0.025	0.030
0.90	0.872	0.006	0.041	0.015	0.027

The dialect FP gap is near-parity across the whole frontier (raw bias is ~5x lower than the ModernBERT-large baseline's 0.21 gap). Default to 0.50 for max recall; raise toward 0.80 for extra parity margin (e.g. corpus filtering).

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tok = AutoTokenizer.from_pretrained("Aeryx-ai/guardian-dialect-fair-hate")
model = AutoModelForSequenceClassification.from_pretrained("Aeryx-ai/guardian-dialect-fair-hate").eval()

def hate_prob(text):
    enc = tok(text, return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        return torch.softmax(model(**enc).logits, -1)[0, 1].item()

is_hate = hate_prob("...") >= 0.5

How it was debiased

Bias-aware training sources — Measuring Hate Speech (UC Berkeley), ToxiGen, DynaHate, Civil Comments (using identity_attack to separate group-hate from mere toxicity). The canonically AAE-biased Davidson/Founta sets are excluded as label sources.
High-AAE-safe augmentation — ~14k benign high-AAE examples (mined from social text, TwitterAAE-filtered) so the model cannot use dialect as a hate cue.
FP-parity gate — release was conditioned on benign high-AAE FP ≈ benign low-AAE FP on a held-out set (not just overall F1). The eval set is held out from the augmentation distribution.

Loss reweighting and DANN-style adversarial debiasing were tried; data volume of high-AAE-safe text was what actually closed the gap.

Limitations

English only. TwitterAAE is a Twitter-domain distant-supervision proxy for dialect, not ground truth, and is noisier on long-form/other-domain text.
Recall/parity tradeoff is explicit in the frontier — higher thresholds trade recall for margin.
Not for high-stakes automated decisions about individuals. Intended for content moderation triage and training-corpus filtering, with humans in the loop.
Trained on a "targeted hate" policy; it will not flag profanity or controversy by design.

Citation

@misc{grigsby2026guardian,
  title  = {Guardian: A Dialect-Fair Hate Classifier},
  author = {Grigsby, Guy},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Aeryx-ai/guardian-dialect-fair-hate}}
}

References: Sap et al. 2019 (Risk of Racial Bias in Hate Speech Detection); Blodgett et al. 2016 (TwitterAAE); Kennedy et al. (Measuring Hate Speech); Hartvigsen et al. (ToxiGen); Vidgen et al. 2021 (DynaHate).

Downloads last month: 26

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for Aeryx-ai/guardian-dialect-fair-hate

Base model

answerdotai/ModernBERT-large

Finetuned

(297)

this model