| # Ethics Impact Statement | |
| ## Who benefits | |
| - Users in communities: reduced exposure to harmful content. | |
| - Moderators: reduced workload and improved triage. | |
| - Platforms: improved trust and safety outcomes. | |
| ## Who could be harmed | |
| - Users whose content is incorrectly flagged (false positives). | |
| - Vulnerable groups if the model exhibits identity-term bias. | |
| ## Bias & fairness risks | |
| Toxicity detectors often over-predict toxicity for text mentioning certain identities. | |
| We mitigate by: | |
| - Using Detoxify "unbiased" baseline. | |
| - Requiring human review for borderline cases. | |
| - Proposing fairness slice evaluations (e.g., identity mention groups). | |
| ## Explainability for stakeholders | |
| We provide: | |
| - top contributing label probabilities (not token-level explanations), | |
| - clear action rationale, | |
| - audit logs for moderation decisions (privacy-preserving). | |
| ## Misuse risks | |
| - Over-reliance on automation; mitigate with human-in-the-loop. | |
| - Using the model to target/harass users; avoid exposing raw scores broadly. | |
| This system is intended to assist moderation, not replace human judgment. | |