mmBERT Safety Classifier - Level 1 (Binary)

A binary safety classifier for detecting unsafe content in LLM inputs. Part of a hierarchical MLCommons-aligned safety classification system.

Model Description

This is Level 1 of a 2-level hierarchical safety classifier:

  • Level 1 (this model): Binary classification (safe/unsafe) - high recall for catching threats
  • Level 2: 9-class hazard taxonomy (MLCommons AI Safety aligned) - for categorizing unsafe content

Performance

Metric Score
Accuracy 84.9%
F1 Score 84.9%

Labels

ID Label Description
0 safe Content is safe
1 unsafe Content is potentially harmful

Training Hyperparameters

Parameter Value
Base Model jhu-clsp/mmBERT-base
LoRA Rank 32
LoRA Alpha 64
LoRA Dropout 0.1
Learning Rate 5e-5
Epochs 10
Batch Size 64
Max Samples 18,000
Training Samples 12,600
Validation Samples 2,700

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-safety-classifier-level1")
base_model = AutoModelForSequenceClassification.from_pretrained(
    "jhu-clsp/mmBERT-base", 
    num_labels=2,
    torch_dtype=torch.float32
)
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mmbert-safety-classifier-level1")
model.eval()

# Inference
text = "How do I bake a chocolate cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    pred = outputs.logits.argmax(-1).item()
    
labels = {0: "safe", 1: "unsafe"}
print(f"Prediction: {labels[pred]}")

Hierarchical Classification Pipeline

For complete safety classification, use with Level 2:

# If Level 1 predicts "unsafe", run Level 2 for hazard category
if pred == 1:  # unsafe
    # Load and run Level 2 model for specific hazard category
    # See: llm-semantic-router/mmbert-safety-classifier-level2
    pass

Training Data

  • Primary: nvidia/Aegis-AI-Content-Safety-Dataset-2.0
  • Enhanced: 110 edge case examples for underrepresented categories (CSE, medical advice, misinformation)
  • Balance: 50/50 safe/unsafe split with oversampling

Intended Use

  • Content moderation for LLM applications
  • Input filtering for AI safety systems
  • Guardrail implementation for chatbots and AI assistants

Limitations

  • May miss subtle misinformation content
  • Trained primarily on English content
  • Should be used as part of a defense-in-depth strategy

Citation

@misc{mmbert-safety-classifier,
  title={mmBERT Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/llm-semantic-router/mmbert-safety-classifier-level1}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mmbert-safety-classifier-level1

Adapter
(9)
this model

Dataset used to train llm-semantic-router/mmbert-safety-classifier-level1