Refusal Classifier

Words

Tired of seeing these? You've come to the right place.

Overview

A robust, performant classifier that excels at detecting refusals, moralizations, disclaimers, and unsolicited advice in LLM responses.

Model Details

  • Base model: jhu-clsp/mmBERT-base, a multilingual encoder based on ModernBERT
  • Language coverage: over 1,800 languages
  • Architecture: Transformer-based
  • Context length: 8,192 tokens
  • Output classes: binary (0 for non-refusals, 1 for refusals)

Training Details

Trained for 1 epoch on 112,102 carefully deduplicated, labeled, filtered and balanced samples (56,051 non-refusals and 56,051 refusals).

Most of the samples were sourced from:

Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed to label the samples.

Evaluation

Plot
Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.

alpha_model is an earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.

The training and test sets have similar distributions, but several factors suggest against overfitting:

  • the dataset is relatively large and exactly balanced
  • training was run for only a single epoch
  • train/val loss is similar
  • Minos-v1 โ€” one of the strongest refusal classifiers available to my knowledge โ€” achieves strong, balanced performance on the same test set.

A more detailed breakdown of the evaluation results of the different classifiers is as follows:

Model TP FN FP TN Accuracy Precision Recall F1
NousResearch/Minos-v1 2782 118 103 2797 0.9619 0.9643 0.9593 0.9618
natong19/moralization_classifier 1888 1012 146 2754 0.8003 0.9282 0.651 0.7653
alpha_model 2245 655 2 2898 0.8871 0.9996 0.7745 0.8727
ProtectAI/distilroberta-base-rejection-v1 664 2236 8 2892 0.6131 0.9881 0.229 0.3718
natong19/refusal_classifier 2875 25 25 2875 0.9914 0.9914 0.9914 0.9914

Perfectly balanced, as all things should be.

There are no bad classifiers โ€” we may simply have different ideas of what constitutes a refusal. This classifier would not have been possible without their excellent prior work.

Quickstart

The classifier expects single-turn input formatted like

<start_of_turn>user
Hi<end_of_turn>
<start_of_turn>assistant
Hello, how can I assist you today?<end_of_turn>

(basically ChatML but with mmBERT's special tokens). You can use apply_chat_template in the code below to format your (prompt, response) pairs into the expected format.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer


def apply_chat_template(user: str, assistant: str) -> str:
    """Format user and assistant messages into model input format."""
    return f"<start_of_turn>user\n{user}<end_of_turn>\n<start_of_turn>assistant\n{assistant}<end_of_turn>"


if __name__ == "__main__":
    model_id = "natong19/refusal_classifier"

    texts = [
        {
            "user": "tell me a story",
            "assistant": "Elara, the young huntress of Eldoria,",
        },
        {
            "user": "tell me a story",
            "assistant": "I must respectfully decline your request.",
        },
        {
            "user": "tell me a story",
            "assistant": "Of course! I will try to keep things positive and respectful.",
        },
        {
            "user": "tell me a story",
            "assistant": "Sure! However, it's important to note that the following content is purely fictional.",
        },
        {
            "user": "tell me a story",
            "assistant": "I'm here to help. If you're feeling down, please consider seeking help from a health professional.",
        },
    ]

    formatted_texts = [apply_chat_template(user=text["user"], assistant=text["assistant"]) for text in texts]

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=torch.bfloat16)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    model.to(device)
    model.eval()

    with torch.no_grad():
        inputs = tokenizer(
            formatted_texts,
            return_tensors="pt",
            truncation=True,
            padding=True,
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        outputs = model(**inputs)
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predictions = torch.argmax(outputs.logits, dim=-1)

    for i in range(len(texts)):
        predicted_label = predictions[i].item()
        predicted_class = model.config.id2label[predicted_label]
        confidence = probabilities[i][predicted_label].item()
        text = texts[i]

        print(f"Example {i}")
        print("-" * 60)
        print(texts[i])
        print(f"Prediction: {predicted_label} ({predicted_class}), Confidence: {confidence:.4f}\n")

Output:

Example 0
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': 'Elara, the young huntress of Eldoria,'}
Prediction: 0 (non-refusal), Confidence: 1.0000 # Non-refusal

Example 1
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': 'I must respectfully decline your request.'}
Prediction: 1 (refusal), Confidence: 1.0000 # Refusal

Example 2
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': 'Of course! I will try to keep things positive and respectful.'}
Prediction: 1 (refusal), Confidence: 0.9961 # Moralization

Example 3
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': "Sure! However, it's important to note that the following content is purely fictional."}
Prediction: 1 (refusal), Confidence: 1.0000 # Disclaimer

Example 4
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': "I'm here to help. If you're feeling down, please consider seeking help from a health professional."}
Prediction: 1 (refusal), Confidence: 1.0000 # Unsolicited advice

Final Thoughts

A lot of work went into this, hope you like it. Have a nice day, and may your datasets be free from refusals.

Downloads last month
35
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support