mmBERT Safety Classifier - Level 1 (Binary)
A binary safety classifier for detecting unsafe content in LLM inputs. Part of a hierarchical MLCommons-aligned safety classification system.
Model Description
This is Level 1 of a 2-level hierarchical safety classifier:
- Level 1 (this model): Binary classification (safe/unsafe) - high recall for catching threats
- Level 2: 9-class hazard taxonomy (MLCommons AI Safety aligned) - for categorizing unsafe content
Performance
| Metric | Score |
|---|---|
| Accuracy | 84.9% |
| F1 Score | 84.9% |
Labels
| ID | Label | Description |
|---|---|---|
| 0 | safe | Content is safe |
| 1 | unsafe | Content is potentially harmful |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base Model | jhu-clsp/mmBERT-base |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| LoRA Dropout | 0.1 |
| Learning Rate | 5e-5 |
| Epochs | 10 |
| Batch Size | 64 |
| Max Samples | 18,000 |
| Training Samples | 12,600 |
| Validation Samples | 2,700 |
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
# Load model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-safety-classifier-level1")
base_model = AutoModelForSequenceClassification.from_pretrained(
"jhu-clsp/mmBERT-base",
num_labels=2,
torch_dtype=torch.float32
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert-safety-classifier-level1")
model.eval()
# Inference
text = "How do I bake a chocolate cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(-1).item()
labels = {0: "safe", 1: "unsafe"}
print(f"Prediction: {labels[pred]}")
Hierarchical Classification Pipeline
For complete safety classification, use with Level 2:
# If Level 1 predicts "unsafe", run Level 2 for hazard category
if pred == 1: # unsafe
# Load and run Level 2 model for specific hazard category
# See: llm-semantic-router/mmbert-safety-classifier-level2
pass
Training Data
- Primary: nvidia/Aegis-AI-Content-Safety-Dataset-2.0
- Enhanced: 110 edge case examples for underrepresented categories (CSE, medical advice, misinformation)
- Balance: 50/50 safe/unsafe split with oversampling
Intended Use
- Content moderation for LLM applications
- Input filtering for AI safety systems
- Guardrail implementation for chatbots and AI assistants
Limitations
- May miss subtle misinformation content
- Trained primarily on English content
- Should be used as part of a defense-in-depth strategy
Citation
@misc{mmbert-safety-classifier,
title={mmBERT Safety Classifier},
author={LLM Semantic Router Team},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/llm-semantic-router/mmbert-safety-classifier-level1}
}
- Downloads last month
- 14
Model tree for llm-semantic-router/mmbert-safety-classifier-level1
Base model
jhu-clsp/mmBERT-base