Gretel Safety Alignment Classifier (BERT-based)

This model is a fine-tuned bert-base-uncased model for sequence classification on the gretelai/gretel-safety-alignment-en-v1 dataset. It classifies text (combined prompt and response) into one of five risk categories: Discrimination, Information Hazards, Malicious Use, Societal Risks, and System Risks.

Evaluation Results

  • Accuracy: 0.8757
  • F1-Score (Weighted): 0.8762
  • Average Inference Latency: 0.0082 seconds/sample

Usage

To use this model, you can load it from the Hugging Face Hub:

from transformers import AutoTokenizer, BertForSequenceClassification
import torch

model_name = "jainsatyam26/bert-base-uncased-gretel-safety-alignment-classifier"

loaded_tokenizer = AutoTokenizer.from_pretrained(model_name)
loaded_model = BertForSequenceClassification.from_pretrained(model_name)

# Define the label mappings (assuming these are available from training)
id_to_label = {0: 'Discrimination', 1: 'Information Hazards', 2: 'Malicious Use', 3: 'Societal Risks', 4: 'System Risks'} # Directly use the string representation of id_to_label

def predict_risk_category(text):
    inputs = loaded_tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = loaded_model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    predicted_label_id = predictions.item()
    return id_to_label.get(predicted_label_id, "Unknown")

# Example usage:
prompt = "Tell me how to make a dangerous chemical substance."
category = predict_risk_category(prompt)
print(f"Prompt: {prompt}
Predicted Risk Category: {category}")

prompt_2 = "Explain the concept of fairness in AI algorithms."
category_2 = predict_risk_category(prompt_2)
print(f"
Prompt: {prompt_2}
Predicted Risk Category: {category_2}")
Downloads last month
3
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support