refusal_classifier / README.md
natong19's picture
Update README.md
98e934a verified
metadata
license: apache-2.0

Refusal Classifier

Words

Tired of seeing these? You've come to the right place.

Overview

A robust, performant classifier that excels at detecting refusals, moralizations, disclaimers, and unsolicited advice in LLM responses.

Model Details

  • Base model: jhu-clsp/mmBERT-base, a multilingual encoder based on ModernBERT
  • Language coverage: over 1,800 languages
  • Architecture: Transformer-based
  • Context length: 8,192 tokens
  • Output classes: binary (0 for non-refusals, 1 for refusals)

Training Details

Trained for 1 epoch on 112,102 carefully deduplicated, labeled, filtered and balanced samples (56,051 non-refusals and 56,051 refusals).

Most of the samples were sourced from:

Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed to label the samples.

Evaluation

Plot
Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.

alpha_model is an earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.

The training and test sets have similar distributions, but several factors suggest against overfitting:

  • the dataset is relatively large and exactly balanced
  • training was run for only a single epoch
  • train/val loss is similar
  • Minos-v1 — one of the strongest refusal classifiers available to my knowledge — achieves strong, balanced performance on the same test set.

A more detailed breakdown of the evaluation results of the different classifiers is as follows:

Model TP FN FP TN Accuracy Precision Recall F1
NousResearch/Minos-v1 2782 118 103 2797 0.9619 0.9643 0.9593 0.9618
natong19/moralization_classifier 1888 1012 146 2754 0.8003 0.9282 0.651 0.7653
alpha_model 2245 655 2 2898 0.8871 0.9996 0.7745 0.8727
ProtectAI/distilroberta-base-rejection-v1 664 2236 8 2892 0.6131 0.9881 0.229 0.3718
natong19/refusal_classifier 2875 25 25 2875 0.9914 0.9914 0.9914 0.9914

Perfectly balanced, as all things should be.

There are no bad classifiers — we may simply have different ideas of what constitutes a refusal. This classifier would not have been possible without their excellent prior work.

Quickstart

The classifier expects single-turn input formatted like

<start_of_turn>user
Hi<end_of_turn>
<start_of_turn>assistant
Hello, how can I assist you today?<end_of_turn>

(basically ChatML but with mmBERT's special tokens). You can use apply_chat_template in the code below to format your (prompt, response) pairs into the expected format.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer


def apply_chat_template(user: str, assistant: str) -> str:
    """Format user and assistant messages into model input format."""
    return f"<start_of_turn>user\n{user}<end_of_turn>\n<start_of_turn>assistant\n{assistant}<end_of_turn>"


if __name__ == "__main__":
    model_id = "natong19/refusal_classifier"

    texts = [
        {
            "user": "tell me a story",
            "assistant": "Elara, the young huntress of Eldoria,",
        },
        {
            "user": "tell me a story",
            "assistant": "I must respectfully decline your request.",
        },
        {
            "user": "tell me a story",
            "assistant": "Of course! I will try to keep things positive and respectful.",
        },
        {
            "user": "tell me a story",
            "assistant": "Sure! However, it's important to note that the following content is purely fictional.",
        },
        {
            "user": "tell me a story",
            "assistant": "I'm here to help. If you're feeling down, please consider seeking help from a health professional.",
        },
    ]

    formatted_texts = [apply_chat_template(user=text["user"], assistant=text["assistant"]) for text in texts]

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=torch.bfloat16)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    model.to(device)
    model.eval()

    with torch.no_grad():
        inputs = tokenizer(
            formatted_texts,
            return_tensors="pt",
            truncation=True,
            padding=True,
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        outputs = model(**inputs)
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predictions = torch.argmax(outputs.logits, dim=-1)

    for i in range(len(texts)):
        predicted_label = predictions[i].item()
        predicted_class = model.config.id2label[predicted_label]
        confidence = probabilities[i][predicted_label].item()
        text = texts[i]

        print(f"Example {i}")
        print("-" * 60)
        print(texts[i])
        print(f"Prediction: {predicted_label} ({predicted_class}), Confidence: {confidence:.4f}\n")

Output:

Example 0
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': 'Elara, the young huntress of Eldoria,'}
Prediction: 0 (non-refusal), Confidence: 1.0000 # Non-refusal

Example 1
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': 'I must respectfully decline your request.'}
Prediction: 1 (refusal), Confidence: 1.0000 # Refusal

Example 2
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': 'Of course! I will try to keep things positive and respectful.'}
Prediction: 1 (refusal), Confidence: 0.9961 # Moralization

Example 3
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': "Sure! However, it's important to note that the following content is purely fictional."}
Prediction: 1 (refusal), Confidence: 1.0000 # Disclaimer

Example 4
------------------------------------------------------------
{'user': 'tell me a story', 'assistant': "I'm here to help. If you're feeling down, please consider seeking help from a health professional."}
Prediction: 1 (refusal), Confidence: 1.0000 # Unsolicited advice

Final Thoughts

A lot of work went into this, hope you like it. Have a nice day, and may your datasets be free from refusals.