mmBERT Safety Classifier - Level 1 (Binary)

A binary safety classifier for detecting unsafe content in LLM inputs. Part of a hierarchical MLCommons-aligned safety classification system.

Model Description

This is Level 1 of a 2-level hierarchical safety classifier:

Level 1 (this model): Binary classification (safe/unsafe) - high recall for catching threats
Level 2: 9-class hazard taxonomy (MLCommons AI Safety aligned) - for categorizing unsafe content

Performance

Metric	Score
Accuracy	84.9%
F1 Score	84.9%

Labels

ID	Label	Description
0	safe	Content is safe
1	unsafe	Content is potentially harmful

Training Hyperparameters

Parameter	Value
Base Model	jhu-clsp/mmBERT-base
LoRA Rank	32
LoRA Alpha	64
LoRA Dropout	0.1
Learning Rate	5e-5
Epochs	10
Batch Size	64
Max Samples	18,000
Training Samples	12,600
Validation Samples	2,700

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-safety-classifier-level1")
base_model = AutoModelForSequenceClassification.from_pretrained(
    "jhu-clsp/mmBERT-base", 
    num_labels=2,
    torch_dtype=torch.float32
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert-safety-classifier-level1")
model.eval()

# Inference
text = "How do I bake a chocolate cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    pred = outputs.logits.argmax(-1).item()
    
labels = {0: "safe", 1: "unsafe"}
print(f"Prediction: {labels[pred]}")

Hierarchical Classification Pipeline

For complete safety classification, use with Level 2:

# If Level 1 predicts "unsafe", run Level 2 for hazard category
if pred == 1:  # unsafe
    # Load and run Level 2 model for specific hazard category
    # See: llm-semantic-router/mmbert-safety-classifier-level2
    pass

Training Data

Primary: nvidia/Aegis-AI-Content-Safety-Dataset-2.0
Enhanced: 110 edge case examples for underrepresented categories (CSE, medical advice, misinformation)
Balance: 50/50 safe/unsafe split with oversampling

Intended Use

Content moderation for LLM applications
Input filtering for AI safety systems
Guardrail implementation for chatbots and AI assistants

Limitations

May miss subtle misinformation content
Trained primarily on English content
Should be used as part of a defense-in-depth strategy

Citation

@misc{mmbert-safety-classifier,
  title={mmBERT Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/llm-semantic-router/mmbert-safety-classifier-level1}
}

Downloads last month: 7

Model tree for llm-semantic-router/mmbert-safety-classifier-level1

Base model

jhu-clsp/mmBERT-base

Adapter

(9)

this model

llm-semantic-router
/

mmbert-safety-classifier-level1