Guardian β€” Dialect-Fair Hate Classifier (ModernBERT-large)

A binary hate-speech classifier fine-tuned from ModernBERT-large, explicitly debiased against African-American English (AAE) false positives.

Off-the-shelf toxicity/hate classifiers flag benign AAE text as hateful at 2x+ the rate of benign General-American English (Sap et al. 2019). Guardian targets and measures that gap directly: it catches 93% of hate while keeping the benign-AAE false-positive rate within ~0.04 of the benign-GAE rate.

Author: Guy Grigsby (Aeryx-ai). License: MIT.

Labels

0 = not hate, 1 = hate. Output the softmax probability of class 1 and threshold it. The policy is targeted hate (slur-as-attack, protected-group harassment) β€” profanity, insults, dark themes, sexual content, and political opinion are deliberately not flagged.

Operating frontier

Evaluated on HateCheck (recall, FP) and a held-out dialect-balanced benign set (FP on high-AAE vs low-AAE text, dialect measured with TwitterAAE). Pick a threshold from the curve:

threshold recall HateCheck FP high-AAE benign FP low-AAE benign FP dialect FP gap
0.50 0.934 0.013 0.082 0.040 0.042
0.80 0.906 0.007 0.055 0.025 0.030
0.90 0.872 0.006 0.041 0.015 0.027

The dialect FP gap is near-parity across the whole frontier (raw bias is ~5x lower than the ModernBERT-large baseline's 0.21 gap). Default to 0.50 for max recall; raise toward 0.80 for extra parity margin (e.g. corpus filtering).

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tok = AutoTokenizer.from_pretrained("Aeryx-ai/guardian-dialect-fair-hate")
model = AutoModelForSequenceClassification.from_pretrained("Aeryx-ai/guardian-dialect-fair-hate").eval()

def hate_prob(text):
    enc = tok(text, return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        return torch.softmax(model(**enc).logits, -1)[0, 1].item()

is_hate = hate_prob("...") >= 0.5

How it was debiased

  1. Bias-aware training sources β€” Measuring Hate Speech (UC Berkeley), ToxiGen, DynaHate, Civil Comments (using identity_attack to separate group-hate from mere toxicity). The canonically AAE-biased Davidson/Founta sets are excluded as label sources.
  2. High-AAE-safe augmentation β€” ~14k benign high-AAE examples (mined from social text, TwitterAAE-filtered) so the model cannot use dialect as a hate cue.
  3. FP-parity gate β€” release was conditioned on benign high-AAE FP β‰ˆ benign low-AAE FP on a held-out set (not just overall F1). The eval set is held out from the augmentation distribution.

Loss reweighting and DANN-style adversarial debiasing were tried; data volume of high-AAE-safe text was what actually closed the gap.

Limitations

  • English only. TwitterAAE is a Twitter-domain distant-supervision proxy for dialect, not ground truth, and is noisier on long-form/other-domain text.
  • Recall/parity tradeoff is explicit in the frontier β€” higher thresholds trade recall for margin.
  • Not for high-stakes automated decisions about individuals. Intended for content moderation triage and training-corpus filtering, with humans in the loop.
  • Trained on a "targeted hate" policy; it will not flag profanity or controversy by design.

Citation

@misc{grigsby2026guardian,
  title  = {Guardian: A Dialect-Fair Hate Classifier},
  author = {Grigsby, Guy},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Aeryx-ai/guardian-dialect-fair-hate}}
}

References: Sap et al. 2019 (Risk of Racial Bias in Hate Speech Detection); Blodgett et al. 2016 (TwitterAAE); Kennedy et al. (Measuring Hate Speech); Hartvigsen et al. (ToxiGen); Vidgen et al. 2021 (DynaHate).

Downloads last month
26
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Aeryx-ai/guardian-dialect-fair-hate

Finetuned
(297)
this model