๐ก๏ธ Guard Safety Classifier
A multi-task safety classifier based on DeBERTa-v3-small trained on 3.9M+ samples for content moderation and safety detection.
๐ฏ Model Tasks
This model performs three simultaneous predictions:
Binary Safety Classification (
is_safe)- โ Safe content
- โ ๏ธ Unsafe content
Single-Label Category Classification (
category)- Identifies the primary safety concern category
Multi-Label Categories (
categories)- Can detect multiple safety issues simultaneously
๐ Performance Metrics
| Metric | Score |
|---|---|
| is_safe Accuracy | 92.76% |
| category F1 | 0.5037 |
| categories F1 | 0.9068 |
| Test Loss | 1.0233 |
๐ Quick Start
import torch
from transformers import AutoTokenizer
import pickle
# Load model and tokenizer
model_name = "YOUR_USERNAME/guard-safety-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load model architecture
from your_model_file import MultiTaskSafetyClassifier
model = MultiTaskSafetyClassifier(
model_name="microsoft/deberta-v3-small",
num_categories=NUM_CATEGORIES,
num_multi_labels=NUM_MULTI_LABELS
)
# Load weights
model.load_state_dict(torch.load("model_weights.pt"))
model.eval()
# Load label encoders
with open("label_encoders.pkl", "rb") as f:
encoders = pickle.load(f)
le_category = encoders['le_category']
mlb = encoders['mlb']
# Inference
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", max_length=128,
truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
is_safe = torch.softmax(outputs['is_safe'], dim=1)[0][1].item() > 0.5
category = le_category.inverse_transform([outputs['category'].argmax(1).item()])[0]
categories = mlb.inverse_transform((torch.sigmoid(outputs['categories']) > 0.5).cpu().numpy())[0]
print(f"Is Safe: {is_safe}")
print(f"Category: {category}")
print(f"Categories: {list(categories)}")
๐๏ธ Model Architecture
- Base Model:
microsoft/deberta-v3-small(141M parameters) - Hidden Size: 768
- Max Sequence Length: 128 tokens
- Training Framework: PyTorch + Transformers
๐ Training Details
- Dataset: budecosystem/guardrail-training-data
- Training Samples: 3,182,844
- Validation Samples: 397,855
- Test Samples: 397,856
- Batch Size: 64
- Learning Rate: 2e-5
- Epochs: 1
- Optimizer: AdamW with linear warmup
- Hardware: NVIDIA Tesla T4 (16GB)
- Training Time: ~8 hours
๐ท๏ธ Categories
The model can identify the following safety categories:
[
"animal_abuse",
"benign",
"child_abuse",
"code_vulnerabilities",
"controversial_topics_politics",
"cwe_compliance",
"dangerous_expert_advice",
"discrimination_stereotype_injustice",
"drug_abuse_weapons_banned_substance",
"financial_crime_property_crime_theft",
"fraud_deception_misinformation",
"gender_bias",
"hate_speech_offensive_language",
"jailbreak_prompt_injection",
"malware_hacking_cyberattack",
"misinformation_regarding_ethics_laws_and_safety",
"mitre_compliance",
"non_violent_unethical_behavior",
"orientation_bias",
"privacy_violation",
"race_bias",
"religious_bias",
"self_harm",
"sexually_explicit_adult_content",
"terrorism_organized_crime",
"violence_aiding_and_abetting_incitement"
]
๐ข Multi-Label Classes
[
" ",
",",
"_",
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z"
]
โ๏ธ Configuration
Full model configuration is available in config.json
๐ License
Apache 2.0
๐ Acknowledgments
- Base model: microsoft/deberta-v3-small
- Training data: budecosystem/guardrail-training-data
๐ฎ Contact
For questions or issues, please open an issue on the model repository.
- Downloads last month
- 6