🛡️ Guard Safety Classifier

A multi-task safety classifier based on DeBERTa-v3-small trained on 3.9M+ samples for content moderation and safety detection.

🎯 Model Tasks

This model performs three simultaneous predictions:

Binary Safety Classification (is_safe)
- ✅ Safe content
- ⚠️ Unsafe content
Single-Label Category Classification (category)
- Identifies the primary safety concern category
Multi-Label Categories (categories)
- Can detect multiple safety issues simultaneously

📊 Performance Metrics

Metric	Score
is_safe Accuracy	92.76%
category F1	0.5037
categories F1	0.9068
Test Loss	1.0233

🚀 Quick Start

import torch
from transformers import AutoTokenizer
import pickle

# Load model and tokenizer
model_name = "YOUR_USERNAME/guard-safety-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model architecture
from your_model_file import MultiTaskSafetyClassifier
model = MultiTaskSafetyClassifier(
    model_name="microsoft/deberta-v3-small",
    num_categories=NUM_CATEGORIES,
    num_multi_labels=NUM_MULTI_LABELS
)

# Load weights
model.load_state_dict(torch.load("model_weights.pt"))
model.eval()

# Load label encoders
with open("label_encoders.pkl", "rb") as f:
    encoders = pickle.load(f)
    le_category = encoders['le_category']
    mlb = encoders['mlb']

# Inference
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", max_length=128, 
                   truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    
is_safe = torch.softmax(outputs['is_safe'], dim=1)[0][1].item() > 0.5
category = le_category.inverse_transform([outputs['category'].argmax(1).item()])[0]
categories = mlb.inverse_transform((torch.sigmoid(outputs['categories']) > 0.5).cpu().numpy())[0]

print(f"Is Safe: {is_safe}")
print(f"Category: {category}")
print(f"Categories: {list(categories)}")

🏗️ Model Architecture

Base Model: microsoft/deberta-v3-small (141M parameters)
Hidden Size: 768
Max Sequence Length: 128 tokens
Training Framework: PyTorch + Transformers

📚 Training Details

Dataset: budecosystem/guardrail-training-data
Training Samples: 3,182,844
Validation Samples: 397,855
Test Samples: 397,856
Batch Size: 64
Learning Rate: 2e-5
Epochs: 1
Optimizer: AdamW with linear warmup
Hardware: NVIDIA Tesla T4 (16GB)
Training Time: ~8 hours

🏷️ Categories

The model can identify the following safety categories:

[
  "animal_abuse",
  "benign",
  "child_abuse",
  "code_vulnerabilities",
  "controversial_topics_politics",
  "cwe_compliance",
  "dangerous_expert_advice",
  "discrimination_stereotype_injustice",
  "drug_abuse_weapons_banned_substance",
  "financial_crime_property_crime_theft",
  "fraud_deception_misinformation",
  "gender_bias",
  "hate_speech_offensive_language",
  "jailbreak_prompt_injection",
  "malware_hacking_cyberattack",
  "misinformation_regarding_ethics_laws_and_safety",
  "mitre_compliance",
  "non_violent_unethical_behavior",
  "orientation_bias",
  "privacy_violation",
  "race_bias",
  "religious_bias",
  "self_harm",
  "sexually_explicit_adult_content",
  "terrorism_organized_crime",
  "violence_aiding_and_abetting_incitement"
]

🔢 Multi-Label Classes

[
  " ",
  ",",
  "_",
  "a",
  "b",
  "c",
  "d",
  "e",
  "f",
  "g",
  "h",
  "i",
  "j",
  "k",
  "l",
  "m",
  "n",
  "o",
  "p",
  "r",
  "s",
  "t",
  "u",
  "v",
  "w",
  "x",
  "y",
  "z"
]

⚙️ Configuration

Full model configuration is available in config.json

📄 License

Apache 2.0

🙏 Acknowledgments

Base model: microsoft/deberta-v3-small
Training data: budecosystem/guardrail-training-data

📮 Contact

For questions or issues, please open an issue on the model repository.

Downloads last month: 9

jainsatyam26
/

guard-safety-classifier