MLCommons AI Safety Classifier - Level 1 (Binary)

A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy.

Model Description

This is Level 1 of a hierarchical safety classification system:

  • Level 1 (this model): Binary classification (safe vs unsafe)
  • Level 2: 9-class hazard category classification

The model uses mmBERT (Multilingual ModernBERT) as the base, supporting 1800+ languages.

Training Results

Metric Value
Recall 86.1%
F1 Score 86.5%
False Positive Rate 13.1%
Accuracy 86.6%

Training Data

Model Architecture & Training

Base Model

LoRA Configuration

Parameter Value
Rank (r) 32
Alpha 64
Dropout 0.1
Target Modules attn.Wqkv, attn.Wo, mlp.Wi, mlp.Wo
Trainable Parameters 6.76M (2.15%)

Training Hyperparameters

Parameter Value
Epochs 10
Batch Size 64
Learning Rate 3e-4
Optimizer AdamW
Scheduler Linear warmup

Hardware & Environment

Component Specification
GPU AMD Instinct MI300X
VRAM 192GB HBM3
Platform ROCm 6.2
Container rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0
Training Time ~4 minutes

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary")

# Classify
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
label = "safe" if prediction == 0 else "unsafe"
print(f"Classification: {label}")

Label Mapping

{
  "safe": 0,
  "unsafe": 1
}

Intended Use

This model is designed for:

  • Content moderation pipelines
  • LLM input/output safety filtering
  • Jailbreak and prompt injection detection
  • First-stage filtering before detailed hazard classification

Limitations

  • Optimized for English but supports 1800+ languages via mmBERT
  • Should be used as part of a broader safety system
  • May require domain-specific fine-tuning for specialized applications

Citation

@misc{mlcommons-safety-classifier,
  title={MLCommons AI Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary}
}

License

Apache 2.0

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mlcommons-safety-classifier-level1-binary

Adapter
(9)
this model

Datasets used to train llm-semantic-router/mlcommons-safety-classifier-level1-binary