File size: 3,864 Bytes
739da53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: apache-2.0
base_model: jhu-clsp/mmBERT-base
tags:
  - content-safety
  - text-classification
  - lora
  - peft
  - mlcommons
  - ai-safety
  - jailbreak-detection
  - moderation
datasets:
  - nvidia/Aegis-AI-Content-Safety-Dataset-2.0
  - llm-semantic-router/mlcommons-ai-safety-synth
language:
  - en
  - multilingual
metrics:
  - f1
  - recall
  - accuracy
pipeline_tag: text-classification
library_name: peft
---

# MLCommons AI Safety Classifier - Level 1 (Binary)

A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy.

## Model Description

This is Level 1 of a hierarchical safety classification system:
- **Level 1 (this model)**: Binary classification (safe vs unsafe)
- **Level 2**: 9-class hazard category classification

The model uses **mmBERT** (Multilingual ModernBERT) as the base, supporting 1800+ languages.

## Training Results

| Metric | Value |
|--------|-------|
| **Recall** | 86.1% |
| **F1 Score** | 86.5% |
| **False Positive Rate** | 13.1% |
| **Accuracy** | 86.6% |

## Training Data

- **Total samples**: 20,000 (balanced)
  - Safe: 10,000
  - Unsafe: 10,000
- **Sources**:
  - [AEGIS AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
  - [MLCommons AI Safety Synth](https://huggingface.co/datasets/llm-semantic-router/mlcommons-ai-safety-synth)

## Model Architecture & Training

### Base Model
- **Model**: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base)
- **Architecture**: ModernBERT (314M parameters)

### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | `attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo` |
| Trainable Parameters | 6.76M (2.15%) |

### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Epochs | 10 |
| Batch Size | 64 |
| Learning Rate | 3e-4 |
| Optimizer | AdamW |
| Scheduler | Linear warmup |

## Hardware & Environment

| Component | Specification |
|-----------|---------------|
| GPU | AMD Instinct MI300X |
| VRAM | 192GB HBM3 |
| Platform | ROCm 6.2 |
| Container | `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` |
| Training Time | ~4 minutes |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary")

# Classify
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
label = "safe" if prediction == 0 else "unsafe"
print(f"Classification: {label}")
```

## Label Mapping

```json
{
  "safe": 0,
  "unsafe": 1
}
```

## Intended Use

This model is designed for:
- Content moderation pipelines
- LLM input/output safety filtering
- Jailbreak and prompt injection detection
- First-stage filtering before detailed hazard classification

## Limitations

- Optimized for English but supports 1800+ languages via mmBERT
- Should be used as part of a broader safety system
- May require domain-specific fine-tuning for specialized applications

## Citation

```bibtex
@misc{mlcommons-safety-classifier,
  title={MLCommons AI Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary}
}
```

## License

Apache 2.0