MLCommons AI Safety Classifier - Level 1 (Binary)
A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy.
Model Description
This is Level 1 of a hierarchical safety classification system:
- Level 1 (this model): Binary classification (safe vs unsafe)
- Level 2: 9-class hazard category classification
The model uses mmBERT (Multilingual ModernBERT) as the base, supporting 1800+ languages.
Training Results
| Metric | Value |
|---|---|
| Recall | 86.1% |
| F1 Score | 86.5% |
| False Positive Rate | 13.1% |
| Accuracy | 86.6% |
Training Data
- Total samples: 20,000 (balanced)
- Safe: 10,000
- Unsafe: 10,000
- Sources:
Model Architecture & Training
Base Model
- Model: jhu-clsp/mmBERT-base
- Architecture: ModernBERT (314M parameters)
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | attn.Wqkv, attn.Wo, mlp.Wi, mlp.Wo |
| Trainable Parameters | 6.76M (2.15%) |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 64 |
| Learning Rate | 3e-4 |
| Optimizer | AdamW |
| Scheduler | Linear warmup |
Hardware & Environment
| Component | Specification |
|---|---|
| GPU | AMD Instinct MI300X |
| VRAM | 192GB HBM3 |
| Platform | ROCm 6.2 |
| Container | rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0 |
| Training Time | ~4 minutes |
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary")
# Classify
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
label = "safe" if prediction == 0 else "unsafe"
print(f"Classification: {label}")
Label Mapping
{
"safe": 0,
"unsafe": 1
}
Intended Use
This model is designed for:
- Content moderation pipelines
- LLM input/output safety filtering
- Jailbreak and prompt injection detection
- First-stage filtering before detailed hazard classification
Limitations
- Optimized for English but supports 1800+ languages via mmBERT
- Should be used as part of a broader safety system
- May require domain-specific fine-tuning for specialized applications
Citation
@misc{mlcommons-safety-classifier,
title={MLCommons AI Safety Classifier},
author={LLM Semantic Router Team},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary}
}
License
Apache 2.0
- Downloads last month
- 14
Model tree for llm-semantic-router/mlcommons-safety-classifier-level1-binary
Base model
jhu-clsp/mmBERT-base