File size: 3,864 Bytes

739da53

---
license: apache-2.0
base_model: jhu-clsp/mmBERT-base
tags:
  - content-safety
  - text-classification
  - lora
  - peft
  - mlcommons
  - ai-safety
  - jailbreak-detection
  - moderation
datasets:
  - nvidia/Aegis-AI-Content-Safety-Dataset-2.0
  - llm-semantic-router/mlcommons-ai-safety-synth
language:
  - en
  - multilingual
metrics:
  - f1
  - recall
  - accuracy
pipeline_tag: text-classification
library_name: peft
---

# MLCommons AI Safety Classifier - Level 1 (Binary)

A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy.

## Model Description

This is Level 1 of a hierarchical safety classification system:
- **Level 1 (this model)**: Binary classification (safe vs unsafe)
- **Level 2**: 9-class hazard category classification

The model uses **mmBERT** (Multilingual ModernBERT) as the base, supporting 1800+ languages.

## Training Results

| Metric | Value |
|--------|-------|
| **Recall** | 86.1% |
| **F1 Score** | 86.5% |
| **False Positive Rate** | 13.1% |
| **Accuracy** | 86.6% |

## Training Data

- **Total samples**: 20,000 (balanced)
  - Safe: 10,000
  - Unsafe: 10,000
- **Sources**:
  - [AEGIS AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
  - [MLCommons AI Safety Synth](https://huggingface.co/datasets/llm-semantic-router/mlcommons-ai-safety-synth)

## Model Architecture & Training

### Base Model
- **Model**: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base)
- **Architecture**: ModernBERT (314M parameters)

### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | `attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo` |
| Trainable Parameters | 6.76M (2.15%) |

### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Epochs | 10 |
| Batch Size | 64 |
| Learning Rate | 3e-4 |
| Optimizer | AdamW |
| Scheduler | Linear warmup |

## Hardware & Environment

| Component | Specification |
|-----------|---------------|
| GPU | AMD Instinct MI300X |
| VRAM | 192GB HBM3 |
| Platform | ROCm 6.2 |
| Container | `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` |
| Training Time | ~4 minutes |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary")

# Classify
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
label = "safe" if prediction == 0 else "unsafe"
print(f"Classification: {label}")
```

## Label Mapping

```json
{
  "safe": 0,
  "unsafe": 1
}
```

## Intended Use

This model is designed for:
- Content moderation pipelines
- LLM input/output safety filtering
- Jailbreak and prompt injection detection
- First-stage filtering before detailed hazard classification

## Limitations

- Optimized for English but supports 1800+ languages via mmBERT
- Should be used as part of a broader safety system
- May require domain-specific fine-tuning for specialized applications

## Citation

```bibtex
@misc{mlcommons-safety-classifier,
  title={MLCommons AI Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary}
}
```

## License

Apache 2.0