MLCommons AI Safety Classifier - Level 1 (Binary)

A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy.

Model Description

This is Level 1 of a hierarchical safety classification system:

Level 1 (this model): Binary classification (safe vs unsafe)
Level 2: 9-class hazard category classification

The model uses mmBERT (Multilingual ModernBERT) as the base, supporting 1800+ languages.

Training Results

Metric	Value
Recall	86.1%
F1 Score	86.5%
False Positive Rate	13.1%
Accuracy	86.6%

Training Data

Total samples: 20,000 (balanced)
- Safe: 10,000
- Unsafe: 10,000
Sources:
- AEGIS AI Content Safety Dataset 2.0
- MLCommons AI Safety Synth

Model Architecture & Training

Base Model

Model: jhu-clsp/mmBERT-base
Architecture: ModernBERT (314M parameters)

LoRA Configuration

Parameter	Value
Rank (r)	32
Alpha	64
Dropout	0.1
Target Modules	`attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo`
Trainable Parameters	6.76M (2.15%)

Training Hyperparameters

Parameter	Value
Epochs	10
Batch Size	64
Learning Rate	3e-4
Optimizer	AdamW
Scheduler	Linear warmup

Hardware & Environment

Component	Specification
GPU	AMD Instinct MI300X
VRAM	192GB HBM3
Platform	ROCm 6.2
Container	`rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0`
Training Time	~4 minutes

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary")

# Classify
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
label = "safe" if prediction == 0 else "unsafe"
print(f"Classification: {label}")

Label Mapping

{
  "safe": 0,
  "unsafe": 1
}

Intended Use

This model is designed for:

Content moderation pipelines
LLM input/output safety filtering
Jailbreak and prompt injection detection
First-stage filtering before detailed hazard classification

Limitations

Optimized for English but supports 1800+ languages via mmBERT
Should be used as part of a broader safety system
May require domain-specific fine-tuning for specialized applications

Citation

@misc{mlcommons-safety-classifier,
  title={MLCommons AI Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary}
}

License

Apache 2.0

Downloads last month: 45

Model tree for llm-semantic-router/mlcommons-safety-classifier-level1-binary

Base model

jhu-clsp/mmBERT-base

Adapter

(9)

this model

llm-semantic-router
/

mlcommons-safety-classifier-level1-binary