budecosystem/guardrail-training-data
Viewer โข Updated โข 3.98M โข 83 โข 10
A multi-task safety classifier based on DeBERTa-v3-small trained on 3.9M+ samples for content moderation and safety detection.
This model performs three simultaneous predictions:
Binary Safety Classification (is_safe)
Single-Label Category Classification (category)
Multi-Label Categories (categories)
| Metric | Score |
|---|---|
| is_safe Accuracy | 92.76% |
| category F1 | 0.5037 |
| categories F1 | 0.9068 |
| Test Loss | 1.0233 |
import torch
from transformers import AutoTokenizer
import pickle
# Load model and tokenizer
model_name = "YOUR_USERNAME/guard-safety-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load model architecture
from your_model_file import MultiTaskSafetyClassifier
model = MultiTaskSafetyClassifier(
model_name="microsoft/deberta-v3-small",
num_categories=NUM_CATEGORIES,
num_multi_labels=NUM_MULTI_LABELS
)
# Load weights
model.load_state_dict(torch.load("model_weights.pt"))
model.eval()
# Load label encoders
with open("label_encoders.pkl", "rb") as f:
encoders = pickle.load(f)
le_category = encoders['le_category']
mlb = encoders['mlb']
# Inference
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", max_length=128,
truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
is_safe = torch.softmax(outputs['is_safe'], dim=1)[0][1].item() > 0.5
category = le_category.inverse_transform([outputs['category'].argmax(1).item()])[0]
categories = mlb.inverse_transform((torch.sigmoid(outputs['categories']) > 0.5).cpu().numpy())[0]
print(f"Is Safe: {is_safe}")
print(f"Category: {category}")
print(f"Categories: {list(categories)}")
microsoft/deberta-v3-small (141M parameters)The model can identify the following safety categories:
[
"animal_abuse",
"benign",
"child_abuse",
"code_vulnerabilities",
"controversial_topics_politics",
"cwe_compliance",
"dangerous_expert_advice",
"discrimination_stereotype_injustice",
"drug_abuse_weapons_banned_substance",
"financial_crime_property_crime_theft",
"fraud_deception_misinformation",
"gender_bias",
"hate_speech_offensive_language",
"jailbreak_prompt_injection",
"malware_hacking_cyberattack",
"misinformation_regarding_ethics_laws_and_safety",
"mitre_compliance",
"non_violent_unethical_behavior",
"orientation_bias",
"privacy_violation",
"race_bias",
"religious_bias",
"self_harm",
"sexually_explicit_adult_content",
"terrorism_organized_crime",
"violence_aiding_and_abetting_incitement"
]
[
" ",
",",
"_",
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z"
]
Full model configuration is available in config.json
Apache 2.0
For questions or issues, please open an issue on the model repository.