MLCommons AI Safety Classifier - Level 2 (9-Class Hazard)
A LoRA-finetuned multilingual BERT model for 9-class hazard category classification, following the MLCommons AI Safety Hazard Taxonomy.
Model Description
This is Level 2 of a hierarchical safety classification system:
- Level 1: Binary classification (safe vs unsafe)
- Level 2 (this model): 9-class hazard category classification
The model uses mmBERT (Multilingual ModernBERT) as the base, supporting 1800+ languages.
Training Results
| Metric | Value |
|---|---|
| Recall (macro) | 93.5% |
| F1 Score (macro) | 93.5% |
| Accuracy | 93.5% |
| Precision (macro) | 93.4% |
Hazard Categories (MLCommons AI Safety Taxonomy)
| ID | Category | Description |
|---|---|---|
| S1 | Violent Crimes | Murder, assault, kidnapping, terrorism |
| S2 | Non-Violent Crimes | Fraud, theft, cybercrime, drug trafficking |
| S3 | Sex Crimes | Sexual assault, CSAM, sexual exploitation |
| S5 | Weapons & CBRNE | Weapons creation, chemical/biological/nuclear threats |
| S6 | Self-Harm | Suicide, self-injury, eating disorders |
| S7 | Hate | Discrimination, slurs, hate speech |
| S8 | Specialized Advice | Unqualified medical, legal, financial advice |
| S9 | Privacy | PII exposure, surveillance, data harvesting |
| S13 | Misinformation | Disinformation, conspiracy theories, false claims |
Training Data
- Total samples: ~20,000 (balanced across categories)
- Sources:
- AEGIS AI Content Safety Dataset 2.0 (~18,000 samples)
- MLCommons AI Safety Synth (12,000 synthesized samples for weak categories)
Synthesized Data Distribution
The synthetic dataset targets previously underrepresented categories:
- S2 (Non-Violent Crimes): 2,000 samples
- S6 (Self-Harm): 2,000 samples
- S7 (Hate): 2,000 samples
- S9 (Privacy): 2,000 samples
- S11 (Elections): 2,000 samples
- S13 (Misinformation): 2,000 samples
Model Architecture & Training
Base Model
- Model: jhu-clsp/mmBERT-base
- Architecture: ModernBERT (314M parameters)
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | attn.Wqkv, attn.Wo, mlp.Wi, mlp.Wo |
| Trainable Parameters | 6.76M (2.15%) |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 64 |
| Learning Rate | 3e-4 |
| Optimizer | AdamW |
| Scheduler | Linear warmup |
Hardware & Environment
| Component | Specification |
|---|---|
| GPU | AMD Instinct MI300X |
| VRAM | 192GB HBM3 |
| Platform | ROCm 6.2 |
| Container | rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0 |
| Training Time | ~3.5 minutes |
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level2-hazard")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=9)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level2-hazard")
# Classify
text = "How to hack into someone's email account"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
# Label mapping
labels = [
"S1_violent_crimes", "S2_nonviolent_crimes", "S3_sex_crimes",
"S5_weapons_cbrne", "S6_self_harm", "S7_hate",
"S8_specialized_advice", "S9_privacy", "S13_misinformation"
]
print(f"Hazard Category: {labels[prediction]}")
Label Mapping
{
"S1_violent_crimes": 0,
"S2_nonviolent_crimes": 1,
"S3_sex_crimes": 2,
"S5_weapons_cbrne": 3,
"S6_self_harm": 4,
"S7_hate": 5,
"S8_specialized_advice": 6,
"S9_privacy": 7,
"S13_misinformation": 8
}
Hierarchical Usage (Recommended)
For production use, combine Level 1 and Level 2:
# Step 1: Binary classification (Level 1)
level1_pred = level1_model(inputs)
if level1_pred == "unsafe":
# Step 2: Hazard classification (Level 2)
hazard_category = level2_model(inputs)
Intended Use
This model is designed for:
- Detailed hazard categorization of unsafe content
- Content moderation with specific policy enforcement
- Safety analytics and reporting
- Research on content safety classification
Limitations
- Optimized for English but supports 1800+ languages via mmBERT
- Should be used after Level 1 filtering for efficiency
- Some categories may have regional/cultural variations
- May require domain-specific fine-tuning for specialized applications
Citation
@misc{mlcommons-safety-classifier,
title={MLCommons AI Safety Classifier},
author={LLM Semantic Router Team},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level2-hazard}
}
License
Apache 2.0
- Downloads last month
- 15
Model tree for llm-semantic-router/mlcommons-safety-classifier-level2-hazard
Base model
jhu-clsp/mmBERT-base