mmBERT Safety Classifier - Level 2 (9-Class Hazard Taxonomy)

A 9-class hazard classifier aligned with MLCommons AI Safety taxonomy. Part of a hierarchical safety classification system.

Model Description

This is Level 2 of a 2-level hierarchical safety classifier:

  • Level 1: Binary classification (safe/unsafe) - for initial filtering
  • Level 2 (this model): 9-class hazard taxonomy - for categorizing the type of unsafe content

Performance

Metric Score
Accuracy 91.5%
F1 Macro 91.3%
F1 Weighted 91.3%

MLCommons-Aligned Labels

ID Label MLCommons Category Description
0 S1_violent_crimes S1 Murder, assault, terrorism
1 S2_nonviolent_crimes S2 Theft, fraud, trafficking
2 S3_sex_crimes S3, S4, S12 Sexual exploitation, CSE, adult content
3 S5_weapons_cbrne S5 Chemical, biological, nuclear, explosives
4 S6_self_harm S6 Suicide, self-injury, eating disorders
5 S7_hate S7, S11 Hate speech, harassment, discrimination
6 S8_specialized_advice S8 Medical, legal, financial advice
7 S9_privacy S9 PII, doxing, surveillance
8 S13_misinformation S13 Elections, conspiracy, false info

Training Hyperparameters

Parameter Value
Base Model jhu-clsp/mmBERT-base
LoRA Rank 32
LoRA Alpha 64
LoRA Dropout 0.1
Learning Rate 5e-5
Epochs 10
Batch Size 64
Max Samples 18,000
Training Samples 12,614
Validation Samples 2,700
Classes 9
Samples per Class ~2,000 (balanced via oversampling)

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-safety-classifier-level2")
base_model = AutoModelForSequenceClassification.from_pretrained(
    "jhu-clsp/mmBERT-base", 
    num_labels=9,
    torch_dtype=torch.float32
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert-safety-classifier-level2")
model.eval()

# Inference (use only after Level 1 predicts "unsafe")
text = "How to hack into someone's bank account"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    pred = outputs.logits.argmax(-1).item()
    conf = torch.softmax(outputs.logits, dim=-1)[0][pred].item()

labels = {
    0: "S1_violent_crimes",
    1: "S2_nonviolent_crimes", 
    2: "S3_sex_crimes",
    3: "S5_weapons_cbrne",
    4: "S6_self_harm",
    5: "S7_hate",
    6: "S8_specialized_advice",
    7: "S9_privacy",
    8: "S13_misinformation"
}
print(f"Hazard Category: {labels[pred]} ({conf:.2%})")

Hierarchical Classification Pipeline

Use with Level 1 for complete safety classification:

# Step 1: Run Level 1 to check if content is unsafe
level1_pred = run_level1(text)  # Returns 0=safe, 1=unsafe

# Step 2: If unsafe, run Level 2 for hazard category
if level1_pred == 1:
    hazard_category = run_level2(text)
    print(f"Content is unsafe: {hazard_category}")
else:
    print("Content is safe")

Training Data

  • Primary: nvidia/Aegis-AI-Content-Safety-Dataset-2.0 (18,164 samples)
  • Enhanced Edge Cases:
    • S3_sex_crimes: 20 CSE examples
    • S8_specialized_advice: 25 medical/legal/financial advice examples
    • S13_misinformation: 25 vaccine/election/conspiracy examples
    • S2_nonviolent_crimes: 20 hacking/fraud examples
    • S5_weapons_cbrne: 20 CBRNE examples
  • Balance: Oversampling to ~2,000 samples per class

Class Distribution (Raw Data)

Category Count
S1_violent_crimes 7,184
S7_hate 3,503
S3_sex_crimes 2,537
S9_privacy 1,395
S2_nonviolent_crimes 1,148
S5_weapons_cbrne 707
S13_misinformation 707
S8_specialized_advice 527
S6_self_harm 456

Intended Use

  • Detailed hazard categorization for content moderation
  • Safety analytics and reporting
  • Policy-specific content handling
  • MLCommons-compliant safety systems

Limitations

  • Should only be used after Level 1 classifies content as "unsafe"
  • Some category overlap (e.g., weapons + violent crimes)
  • Misinformation detection is challenging for subtle cases
  • Trained primarily on English content

Citation

@misc{mmbert-safety-classifier,
  title={mmBERT Safety Classifier - Level 2},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/llm-semantic-router/mmbert-safety-classifier-level2}
}

References

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mmbert-safety-classifier-level2

Adapter
(9)
this model

Dataset used to train llm-semantic-router/mmbert-safety-classifier-level2