MLCommons AI Safety Classifier - Level 2 (9-Class Hazard)

A LoRA-finetuned multilingual BERT model for 9-class hazard category classification, following the MLCommons AI Safety Hazard Taxonomy.

Model Description

This is Level 2 of a hierarchical safety classification system:

  • Level 1: Binary classification (safe vs unsafe)
  • Level 2 (this model): 9-class hazard category classification

The model uses mmBERT (Multilingual ModernBERT) as the base, supporting 1800+ languages.

Training Results

Metric Value
Recall (macro) 93.5%
F1 Score (macro) 93.5%
Accuracy 93.5%
Precision (macro) 93.4%

Hazard Categories (MLCommons AI Safety Taxonomy)

ID Category Description
S1 Violent Crimes Murder, assault, kidnapping, terrorism
S2 Non-Violent Crimes Fraud, theft, cybercrime, drug trafficking
S3 Sex Crimes Sexual assault, CSAM, sexual exploitation
S5 Weapons & CBRNE Weapons creation, chemical/biological/nuclear threats
S6 Self-Harm Suicide, self-injury, eating disorders
S7 Hate Discrimination, slurs, hate speech
S8 Specialized Advice Unqualified medical, legal, financial advice
S9 Privacy PII exposure, surveillance, data harvesting
S13 Misinformation Disinformation, conspiracy theories, false claims

Training Data

Synthesized Data Distribution

The synthetic dataset targets previously underrepresented categories:

  • S2 (Non-Violent Crimes): 2,000 samples
  • S6 (Self-Harm): 2,000 samples
  • S7 (Hate): 2,000 samples
  • S9 (Privacy): 2,000 samples
  • S11 (Elections): 2,000 samples
  • S13 (Misinformation): 2,000 samples

Model Architecture & Training

Base Model

LoRA Configuration

Parameter Value
Rank (r) 32
Alpha 64
Dropout 0.1
Target Modules attn.Wqkv, attn.Wo, mlp.Wi, mlp.Wo
Trainable Parameters 6.76M (2.15%)

Training Hyperparameters

Parameter Value
Epochs 10
Batch Size 64
Learning Rate 3e-4
Optimizer AdamW
Scheduler Linear warmup

Hardware & Environment

Component Specification
GPU AMD Instinct MI300X
VRAM 192GB HBM3
Platform ROCm 6.2
Container rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0
Training Time ~3.5 minutes

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level2-hazard")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=9)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level2-hazard")

# Classify
text = "How to hack into someone's email account"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()

# Label mapping
labels = [
    "S1_violent_crimes", "S2_nonviolent_crimes", "S3_sex_crimes",
    "S5_weapons_cbrne", "S6_self_harm", "S7_hate",
    "S8_specialized_advice", "S9_privacy", "S13_misinformation"
]
print(f"Hazard Category: {labels[prediction]}")

Label Mapping

{
  "S1_violent_crimes": 0,
  "S2_nonviolent_crimes": 1,
  "S3_sex_crimes": 2,
  "S5_weapons_cbrne": 3,
  "S6_self_harm": 4,
  "S7_hate": 5,
  "S8_specialized_advice": 6,
  "S9_privacy": 7,
  "S13_misinformation": 8
}

Hierarchical Usage (Recommended)

For production use, combine Level 1 and Level 2:

# Step 1: Binary classification (Level 1)
level1_pred = level1_model(inputs)
if level1_pred == "unsafe":
    # Step 2: Hazard classification (Level 2)
    hazard_category = level2_model(inputs)

Intended Use

This model is designed for:

  • Detailed hazard categorization of unsafe content
  • Content moderation with specific policy enforcement
  • Safety analytics and reporting
  • Research on content safety classification

Limitations

  • Optimized for English but supports 1800+ languages via mmBERT
  • Should be used after Level 1 filtering for efficiency
  • Some categories may have regional/cultural variations
  • May require domain-specific fine-tuning for specialized applications

Citation

@misc{mlcommons-safety-classifier,
  title={MLCommons AI Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level2-hazard}
}

License

Apache 2.0

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mlcommons-safety-classifier-level2-hazard

Adapter
(9)
this model

Datasets used to train llm-semantic-router/mlcommons-safety-classifier-level2-hazard