Upload MLCommons AI Safety Classifier Level 1 (Binary)

739da53 verified 17 days ago

3.86 kB

	---
	license: apache-2.0
	base_model: jhu-clsp/mmBERT-base
	tags:
	- content-safety
	- text-classification
	- lora
	- peft
	- mlcommons
	- ai-safety
	- jailbreak-detection
	- moderation
	datasets:
	- nvidia/Aegis-AI-Content-Safety-Dataset-2.0
	- llm-semantic-router/mlcommons-ai-safety-synth
	language:
	- en
	- multilingual
	metrics:
	- f1
	- recall
	- accuracy
	pipeline_tag: text-classification
	library_name: peft
	---

	# MLCommons AI Safety Classifier - Level 1 (Binary)

	A LoRA-finetuned multilingual BERT model for binary content safety classification (safe/unsafe), following the MLCommons AI Safety Hazard Taxonomy.

	## Model Description

	This is Level 1 of a hierarchical safety classification system:
	- Level 1 (this model): Binary classification (safe vs unsafe)
	- Level 2: 9-class hazard category classification

	The model uses mmBERT (Multilingual ModernBERT) as the base, supporting 1800+ languages.

	## Training Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Recall \| 86.1% \|
	\| F1 Score \| 86.5% \|
	\| False Positive Rate \| 13.1% \|
	\| Accuracy \| 86.6% \|

	## Training Data

	- Total samples: 20,000 (balanced)
	- Safe: 10,000
	- Unsafe: 10,000
	- Sources:
	- [AEGIS AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
	- [MLCommons AI Safety Synth](https://huggingface.co/datasets/llm-semantic-router/mlcommons-ai-safety-synth)

	## Model Architecture & Training

	### Base Model
	- Model: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base)
	- Architecture: ModernBERT (314M parameters)

	### LoRA Configuration
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Rank (r) \| 32 \|
	\| Alpha \| 64 \|
	\| Dropout \| 0.1 \|
	\| Target Modules \| `attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo` \|
	\| Trainable Parameters \| 6.76M (2.15%) \|

	### Training Hyperparameters
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 10 \|
	\| Batch Size \| 64 \|
	\| Learning Rate \| 3e-4 \|
	\| Optimizer \| AdamW \|
	\| Scheduler \| Linear warmup \|

	## Hardware & Environment

	\| Component \| Specification \|
	\|-----------\|---------------\|
	\| GPU \| AMD Instinct MI300X \|
	\| VRAM \| 192GB HBM3 \|
	\| Platform \| ROCm 6.2 \|
	\| Container \| `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` \|
	\| Training Time \| ~4 minutes \|

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from peft import PeftModel

	# Load base model and tokenizer
	base_model = "jhu-clsp/mmBERT-base"
	tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level1-binary")
	model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
	model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level1-binary")

	# Classify
	text = "How do I make a cake?"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	outputs = model(**inputs)
	prediction = outputs.logits.argmax(-1).item()
	label = "safe" if prediction == 0 else "unsafe"
	print(f"Classification: {label}")
	```

	## Label Mapping

	```json
	{
	"safe": 0,
	"unsafe": 1
	}
	```

	## Intended Use

	This model is designed for:
	- Content moderation pipelines
	- LLM input/output safety filtering
	- Jailbreak and prompt injection detection
	- First-stage filtering before detailed hazard classification

	## Limitations

	- Optimized for English but supports 1800+ languages via mmBERT
	- Should be used as part of a broader safety system
	- May require domain-specific fine-tuning for specialized applications

	## Citation

	```bibtex
	@misc{mlcommons-safety-classifier,
	title={MLCommons AI Safety Classifier},
	author={LLM Semantic Router Team},
	year={2026},
	publisher={Hugging Face},
	url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level1-binary}
	}
	```

	## License

	Apache 2.0