|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: jhu-clsp/mmBERT-base |
|
|
tags: |
|
|
- content-safety |
|
|
- text-classification |
|
|
- lora |
|
|
- peft |
|
|
- mlcommons |
|
|
- ai-safety |
|
|
- jailbreak-detection |
|
|
- moderation |
|
|
datasets: |
|
|
- nvidia/Aegis-AI-Content-Safety-Dataset-2.0 |
|
|
- llm-semantic-router/mlcommons-ai-safety-synth |
|
|
language: |
|
|
- en |
|
|
- multilingual |
|
|
metrics: |
|
|
- f1 |
|
|
- recall |
|
|
- accuracy |
|
|
pipeline_tag: text-classification |
|
|
library_name: peft |
|
|
--- |
|
|
|
|
|
# MLCommons AI Safety Classifier - Level 1 (Binary) |
|
|
|
|
|
A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is Level 1 of a hierarchical safety classification system: |
|
|
- **Level 1 (this model)**: Binary classification (safe vs unsafe) |
|
|
- **Level 2**: 9-class hazard category classification |
|
|
|
|
|
The model uses **mmBERT** (Multilingual ModernBERT) as the base, supporting 1800+ languages. |
|
|
|
|
|
## Training Results |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Recall** | 86.1% | |
|
|
| **F1 Score** | 86.5% | |
|
|
| **False Positive Rate** | 13.1% | |
|
|
| **Accuracy** | 86.6% | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Total samples**: 20,000 (balanced) |
|
|
- Safe: 10,000 |
|
|
- Unsafe: 10,000 |
|
|
- **Sources**: |
|
|
- [AEGIS AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) |
|
|
- [MLCommons AI Safety Synth](https://huggingface.co/datasets/llm-semantic-router/mlcommons-ai-safety-synth) |
|
|
|
|
|
## Model Architecture & Training |
|
|
|
|
|
### Base Model |
|
|
- **Model**: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) |
|
|
- **Architecture**: ModernBERT (314M parameters) |
|
|
|
|
|
### LoRA Configuration |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Rank (r) | 32 | |
|
|
| Alpha | 64 | |
|
|
| Dropout | 0.1 | |
|
|
| Target Modules | `attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo` | |
|
|
| Trainable Parameters | 6.76M (2.15%) | |
|
|
|
|
|
### Training Hyperparameters |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Epochs | 10 | |
|
|
| Batch Size | 64 | |
|
|
| Learning Rate | 3e-4 | |
|
|
| Optimizer | AdamW | |
|
|
| Scheduler | Linear warmup | |
|
|
|
|
|
## Hardware & Environment |
|
|
|
|
|
| Component | Specification | |
|
|
|-----------|---------------| |
|
|
| GPU | AMD Instinct MI300X | |
|
|
| VRAM | 192GB HBM3 | |
|
|
| Platform | ROCm 6.2 | |
|
|
| Container | `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` | |
|
|
| Training Time | ~4 minutes | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
from peft import PeftModel |
|
|
|
|
|
# Load base model and tokenizer |
|
|
base_model = "jhu-clsp/mmBERT-base" |
|
|
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary") |
|
|
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2) |
|
|
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary") |
|
|
|
|
|
# Classify |
|
|
text = "How do I make a cake?" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
outputs = model(**inputs) |
|
|
prediction = outputs.logits.argmax(-1).item() |
|
|
label = "safe" if prediction == 0 else "unsafe" |
|
|
print(f"Classification: {label}") |
|
|
``` |
|
|
|
|
|
## Label Mapping |
|
|
|
|
|
```json |
|
|
{ |
|
|
"safe": 0, |
|
|
"unsafe": 1 |
|
|
} |
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Content moderation pipelines |
|
|
- LLM input/output safety filtering |
|
|
- Jailbreak and prompt injection detection |
|
|
- First-stage filtering before detailed hazard classification |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for English but supports 1800+ languages via mmBERT |
|
|
- Should be used as part of a broader safety system |
|
|
- May require domain-specific fine-tuning for specialized applications |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{mlcommons-safety-classifier, |
|
|
title={MLCommons AI Safety Classifier}, |
|
|
author={LLM Semantic Router Team}, |
|
|
year={2026}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|