mmBERT Safety Classifier - Level 2 (9-Class Hazard Taxonomy)

A 9-class hazard classifier aligned with MLCommons AI Safety taxonomy. Part of a hierarchical safety classification system.

Model Description

This is Level 2 of a 2-level hierarchical safety classifier:

Level 1: Binary classification (safe/unsafe) - for initial filtering
Level 2 (this model): 9-class hazard taxonomy - for categorizing the type of unsafe content

Performance

Metric	Score
Accuracy	91.5%
F1 Macro	91.3%
F1 Weighted	91.3%

MLCommons-Aligned Labels

ID	Label	MLCommons Category	Description
0	S1_violent_crimes	S1	Murder, assault, terrorism
1	S2_nonviolent_crimes	S2	Theft, fraud, trafficking
2	S3_sex_crimes	S3, S4, S12	Sexual exploitation, CSE, adult content
3	S5_weapons_cbrne	S5	Chemical, biological, nuclear, explosives
4	S6_self_harm	S6	Suicide, self-injury, eating disorders
5	S7_hate	S7, S11	Hate speech, harassment, discrimination
6	S8_specialized_advice	S8	Medical, legal, financial advice
7	S9_privacy	S9	PII, doxing, surveillance
8	S13_misinformation	S13	Elections, conspiracy, false info

Training Hyperparameters

Parameter	Value
Base Model	jhu-clsp/mmBERT-base
LoRA Rank	32
LoRA Alpha	64
LoRA Dropout	0.1
Learning Rate	5e-5
Epochs	10
Batch Size	64
Max Samples	18,000
Training Samples	12,614
Validation Samples	2,700
Classes	9
Samples per Class	~2,000 (balanced via oversampling)

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-safety-classifier-level2")
base_model = AutoModelForSequenceClassification.from_pretrained(
    "jhu-clsp/mmBERT-base", 
    num_labels=9,
    torch_dtype=torch.float32
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert-safety-classifier-level2")
model.eval()

# Inference (use only after Level 1 predicts "unsafe")
text = "How to hack into someone's bank account"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    pred = outputs.logits.argmax(-1).item()
    conf = torch.softmax(outputs.logits, dim=-1)[0][pred].item()

labels = {
    0: "S1_violent_crimes",
    1: "S2_nonviolent_crimes", 
    2: "S3_sex_crimes",
    3: "S5_weapons_cbrne",
    4: "S6_self_harm",
    5: "S7_hate",
    6: "S8_specialized_advice",
    7: "S9_privacy",
    8: "S13_misinformation"
}
print(f"Hazard Category: {labels[pred]} ({conf:.2%})")

Hierarchical Classification Pipeline

Use with Level 1 for complete safety classification:

# Step 1: Run Level 1 to check if content is unsafe
level1_pred = run_level1(text)  # Returns 0=safe, 1=unsafe

# Step 2: If unsafe, run Level 2 for hazard category
if level1_pred == 1:
    hazard_category = run_level2(text)
    print(f"Content is unsafe: {hazard_category}")
else:
    print("Content is safe")

Training Data

Primary: nvidia/Aegis-AI-Content-Safety-Dataset-2.0 (18,164 samples)
Enhanced Edge Cases:
- S3_sex_crimes: 20 CSE examples
- S8_specialized_advice: 25 medical/legal/financial advice examples
- S13_misinformation: 25 vaccine/election/conspiracy examples
- S2_nonviolent_crimes: 20 hacking/fraud examples
- S5_weapons_cbrne: 20 CBRNE examples
Balance: Oversampling to ~2,000 samples per class

Class Distribution (Raw Data)

Category	Count
S1_violent_crimes	7,184
S7_hate	3,503
S3_sex_crimes	2,537
S9_privacy	1,395
S2_nonviolent_crimes	1,148
S5_weapons_cbrne	707
S13_misinformation	707
S8_specialized_advice	527
S6_self_harm	456

Intended Use

Detailed hazard categorization for content moderation
Safety analytics and reporting
Policy-specific content handling
MLCommons-compliant safety systems

Limitations

Should only be used after Level 1 classifies content as "unsafe"
Some category overlap (e.g., weapons + violent crimes)
Misinformation detection is challenging for subtle cases
Trained primarily on English content

Citation

@misc{mmbert-safety-classifier,
  title={mmBERT Safety Classifier - Level 2},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/llm-semantic-router/mmbert-safety-classifier-level2}
}

References

Downloads last month: 3

Model tree for llm-semantic-router/mmbert-safety-classifier-level2

Base model

jhu-clsp/mmBERT-base

Adapter

(9)

this model

llm-semantic-router
/

mmbert-safety-classifier-level2