File size: 3,864 Bytes
739da53 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
license: apache-2.0
base_model: jhu-clsp/mmBERT-base
tags:
- content-safety
- text-classification
- lora
- peft
- mlcommons
- ai-safety
- jailbreak-detection
- moderation
datasets:
- nvidia/Aegis-AI-Content-Safety-Dataset-2.0
- llm-semantic-router/mlcommons-ai-safety-synth
language:
- en
- multilingual
metrics:
- f1
- recall
- accuracy
pipeline_tag: text-classification
library_name: peft
---
# MLCommons AI Safety Classifier - Level 1 (Binary)
A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy.
## Model Description
This is Level 1 of a hierarchical safety classification system:
- **Level 1 (this model)**: Binary classification (safe vs unsafe)
- **Level 2**: 9-class hazard category classification
The model uses **mmBERT** (Multilingual ModernBERT) as the base, supporting 1800+ languages.
## Training Results
| Metric | Value |
|--------|-------|
| **Recall** | 86.1% |
| **F1 Score** | 86.5% |
| **False Positive Rate** | 13.1% |
| **Accuracy** | 86.6% |
## Training Data
- **Total samples**: 20,000 (balanced)
- Safe: 10,000
- Unsafe: 10,000
- **Sources**:
- [AEGIS AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
- [MLCommons AI Safety Synth](https://huggingface.co/datasets/llm-semantic-router/mlcommons-ai-safety-synth)
## Model Architecture & Training
### Base Model
- **Model**: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base)
- **Architecture**: ModernBERT (314M parameters)
### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | `attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo` |
| Trainable Parameters | 6.76M (2.15%) |
### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Epochs | 10 |
| Batch Size | 64 |
| Learning Rate | 3e-4 |
| Optimizer | AdamW |
| Scheduler | Linear warmup |
## Hardware & Environment
| Component | Specification |
|-----------|---------------|
| GPU | AMD Instinct MI300X |
| VRAM | 192GB HBM3 |
| Platform | ROCm 6.2 |
| Container | `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` |
| Training Time | ~4 minutes |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary")
# Classify
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
label = "safe" if prediction == 0 else "unsafe"
print(f"Classification: {label}")
```
## Label Mapping
```json
{
"safe": 0,
"unsafe": 1
}
```
## Intended Use
This model is designed for:
- Content moderation pipelines
- LLM input/output safety filtering
- Jailbreak and prompt injection detection
- First-stage filtering before detailed hazard classification
## Limitations
- Optimized for English but supports 1800+ languages via mmBERT
- Should be used as part of a broader safety system
- May require domain-specific fine-tuning for specialized applications
## Citation
```bibtex
@misc{mlcommons-safety-classifier,
title={MLCommons AI Safety Classifier},
author={LLM Semantic Router Team},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary}
}
```
## License
Apache 2.0
|