mmBERT Safety Classifier - Level 2 (9-Class Hazard Taxonomy)
A 9-class hazard classifier aligned with MLCommons AI Safety taxonomy. Part of a hierarchical safety classification system.
Model Description
This is Level 2 of a 2-level hierarchical safety classifier:
- Level 1: Binary classification (safe/unsafe) - for initial filtering
- Level 2 (this model): 9-class hazard taxonomy - for categorizing the type of unsafe content
Performance
| Metric | Score |
|---|---|
| Accuracy | 91.5% |
| F1 Macro | 91.3% |
| F1 Weighted | 91.3% |
MLCommons-Aligned Labels
| ID | Label | MLCommons Category | Description |
|---|---|---|---|
| 0 | S1_violent_crimes | S1 | Murder, assault, terrorism |
| 1 | S2_nonviolent_crimes | S2 | Theft, fraud, trafficking |
| 2 | S3_sex_crimes | S3, S4, S12 | Sexual exploitation, CSE, adult content |
| 3 | S5_weapons_cbrne | S5 | Chemical, biological, nuclear, explosives |
| 4 | S6_self_harm | S6 | Suicide, self-injury, eating disorders |
| 5 | S7_hate | S7, S11 | Hate speech, harassment, discrimination |
| 6 | S8_specialized_advice | S8 | Medical, legal, financial advice |
| 7 | S9_privacy | S9 | PII, doxing, surveillance |
| 8 | S13_misinformation | S13 | Elections, conspiracy, false info |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base Model | jhu-clsp/mmBERT-base |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| LoRA Dropout | 0.1 |
| Learning Rate | 5e-5 |
| Epochs | 10 |
| Batch Size | 64 |
| Max Samples | 18,000 |
| Training Samples | 12,614 |
| Validation Samples | 2,700 |
| Classes | 9 |
| Samples per Class | ~2,000 (balanced via oversampling) |
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
# Load model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-safety-classifier-level2")
base_model = AutoModelForSequenceClassification.from_pretrained(
"jhu-clsp/mmBERT-base",
num_labels=9,
torch_dtype=torch.float32
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert-safety-classifier-level2")
model.eval()
# Inference (use only after Level 1 predicts "unsafe")
text = "How to hack into someone's bank account"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(-1).item()
conf = torch.softmax(outputs.logits, dim=-1)[0][pred].item()
labels = {
0: "S1_violent_crimes",
1: "S2_nonviolent_crimes",
2: "S3_sex_crimes",
3: "S5_weapons_cbrne",
4: "S6_self_harm",
5: "S7_hate",
6: "S8_specialized_advice",
7: "S9_privacy",
8: "S13_misinformation"
}
print(f"Hazard Category: {labels[pred]} ({conf:.2%})")
Hierarchical Classification Pipeline
Use with Level 1 for complete safety classification:
# Step 1: Run Level 1 to check if content is unsafe
level1_pred = run_level1(text) # Returns 0=safe, 1=unsafe
# Step 2: If unsafe, run Level 2 for hazard category
if level1_pred == 1:
hazard_category = run_level2(text)
print(f"Content is unsafe: {hazard_category}")
else:
print("Content is safe")
Training Data
- Primary: nvidia/Aegis-AI-Content-Safety-Dataset-2.0 (18,164 samples)
- Enhanced Edge Cases:
- S3_sex_crimes: 20 CSE examples
- S8_specialized_advice: 25 medical/legal/financial advice examples
- S13_misinformation: 25 vaccine/election/conspiracy examples
- S2_nonviolent_crimes: 20 hacking/fraud examples
- S5_weapons_cbrne: 20 CBRNE examples
- Balance: Oversampling to ~2,000 samples per class
Class Distribution (Raw Data)
| Category | Count |
|---|---|
| S1_violent_crimes | 7,184 |
| S7_hate | 3,503 |
| S3_sex_crimes | 2,537 |
| S9_privacy | 1,395 |
| S2_nonviolent_crimes | 1,148 |
| S5_weapons_cbrne | 707 |
| S13_misinformation | 707 |
| S8_specialized_advice | 527 |
| S6_self_harm | 456 |
Intended Use
- Detailed hazard categorization for content moderation
- Safety analytics and reporting
- Policy-specific content handling
- MLCommons-compliant safety systems
Limitations
- Should only be used after Level 1 classifies content as "unsafe"
- Some category overlap (e.g., weapons + violent crimes)
- Misinformation detection is challenging for subtle cases
- Trained primarily on English content
Citation
@misc{mmbert-safety-classifier,
title={mmBERT Safety Classifier - Level 2},
author={LLM Semantic Router Team},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/llm-semantic-router/mmbert-safety-classifier-level2}
}
References
- Downloads last month
- 15
Model tree for llm-semantic-router/mmbert-safety-classifier-level2
Base model
jhu-clsp/mmBERT-base