Ethics Impact Statement
Who benefits
- Users in communities: reduced exposure to harmful content.
- Moderators: reduced workload and improved triage.
- Platforms: improved trust and safety outcomes.
Who could be harmed
- Users whose content is incorrectly flagged (false positives).
- Vulnerable groups if the model exhibits identity-term bias.
Bias & fairness risks
Toxicity detectors often over-predict toxicity for text mentioning certain identities. We mitigate by:
- Using Detoxify "unbiased" baseline.
- Requiring human review for borderline cases.
- Proposing fairness slice evaluations (e.g., identity mention groups).
Explainability for stakeholders
We provide:
- top contributing label probabilities (not token-level explanations),
- clear action rationale,
- audit logs for moderation decisions (privacy-preserving).
Misuse risks
- Over-reliance on automation; mitigate with human-in-the-loop.
- Using the model to target/harass users; avoid exposing raw scores broadly.
This system is intended to assist moderation, not replace human judgment.