natong19
/

refusal_classifier

Safetensors

modernbert

Model card Files Files and versions

xet

Community

natong19 commited on 16 days ago

Commit

0304012

verified ·

1 Parent(s): 3ed3a77

Update README.md

Browse files

Files changed (1) hide show

README.md +174 -3

README.md CHANGED Viewed

@@ -1,3 +1,174 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# Refusal Classifier
+<div align="left">
+<img src="figures/words.png" width="60%" alt="Words"/>
+</div>
+*Tired of seeing these? You've come to the right place.*
+## Overview
+A robust and performant classifier that excels at **detecting refusals, moralizations, disclaimers, unsolicited advice** and the like.
+### Model Details
+- Base model: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), a multilingual encoder based on [ModernBERT](answerdotai/ModernBERT-base)
+- Language coverage: over 1,800 languages
+- Architecture: Transformer-based
+- Context length: 8,192 tokens
+- Output classes: binary (0 for non-refusals, 1 for refusals)
+### Training Details
+Trained for 1 epoch on 112,102 carefully deduplicated, labeled and filtered samples (56,051 non-refusals and 56,051 refusals).
+Most of the samples were sourced from:
+- [natong19/lmsys-chat-1m-filtered](https://huggingface.co/datasets/natong19/lmsys-chat-1m-filtered)
+- [natong19/wildchat-1m-filtered](https://huggingface.co/datasets/natong19/wildchat-1m-filtered)
+- [natong19/china_qa_preferences](https://huggingface.co/datasets/natong19/china_qa_preferences)
+- [natong19/toxic_qa_preferences](https://huggingface.co/datasets/natong19/toxic_qa_preferences)
+Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed to label the samples.
+### Evaluation
+<div align="left">
+<img src="figures/plot.png" width="60%" alt="Plot"/>
+</div>
+Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several refusal open-source classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x RTX Pro 6000.
+`alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
+The training and test sets have similar distributions, but several factors suggest against overfitting:
+the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
+A more detailed breakdown is as follows:
+|                                           | TP   | FN   | FP  | TN   | Accuracy | Precision | Recall | F1     |
+| ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
+| [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1)                     | 2782 | 118  | 103 | 2797 | 0.9619   | 0.9643    | 0.9593 | 0.9618 |
+| [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier)          | 1888 | 1012 | 146 | 2754 | 0.8003   | 0.9282    | 0.651  | 0.7653 |
+| alpha_model                               | 2245 | 655  | **2**   | **2898** | 0.8871   | **0.9996**    | 0.7745 | 0.8727 |
+| [ProtectAI/distilroberta-base-rejection-v1](https://huggingface.co/protectai/distilroberta-base-rejection-v1) | 664  | 2236 | 8   | 2892 | 0.6131   | 0.9881    | 0.229  | 0.3718 |
+| [natong19/refusal_classifier](https://huggingface.co/natong19/refusal_classifier)               | **2875** | **25**   | 25  | 2875 | **0.9914**   | 0.9914    | **0.9914** | **0.9914** |
+> Perfectly balanced, as all things should be.
+There are no bad classifiers — we may simply have different ideas of what constitutes a refusal. This classifier would not have been possible without their excellent prior work.
+### Quickstart
+The classifier expects single-turn input formatted like
+```python
+<start_of_turn>user
+Hi<end_of_turn>
+<start_of_turn>assistant
+Hello, How can I assist you today?<end_of_turn>
+```
+(basically ChatML but with the mmBERT's special tokens). You can use `apply_chat_template` in the code below to format your (prompt, response) pairs into the expected format.
+```python
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+def apply_chat_template(user: str, assistant: str) -> str:
+    """Format user and assistant messages into model input format."""
+    return f"<start_of_turn>user\n{user}<end_of_turn>\n<start_of_turn>assistant\n{assistant}<end_of_turn>"
+if __name__ == "__main__":
+    model_id = "natong19/refusal_classifier"
+    texts = [
+        {
+            "user": "tell me a story",
+            "assistant": "Elara, the young huntress of Eldoria,",
+        },
+        {
+            "user": "tell me a story",
+            "assistant": "I must respectfully decline your request.",
+        },
+        {
+            "user": "tell me a story",
+            "assistant": "Of course! I will try to keep things positive and respectful.",
+        },
+        {
+            "user": "tell me a story",
+            "assistant": "Sure! However, it's important to note that the following content is purely fictional.",
+        },
+        {
+            "user": "tell me a story",
+            "assistant": "I'm here to help. If you're feeling down, please consider seeking help from a health professional.",
+        },
+    ]
+    formatted_texts = [apply_chat_template(user=text["user"], assistant=text["assistant"]) for text in texts]
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=torch.bfloat16)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    model.to(device)
+    model.eval()
+    with torch.no_grad():
+        inputs = tokenizer(
+            formatted_texts,
+            return_tensors="pt",
+            truncation=True,
+            padding=True,
+        )
+        inputs = {k: v.to(device) for k, v in inputs.items()}
+        outputs = model(**inputs)
+        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+        predictions = torch.argmax(outputs.logits, dim=-1)
+    for i in range(len(texts)):
+        predicted_label = predictions[i].item()
+        predicted_class = model.config.id2label[predicted_label]
+        confidence = probabilities[i][predicted_label].item()
+        text = texts[i]
+        print(f"Example {i}")
+        print("-" * 60)
+        print(texts[i])
+        print(f"Prediction: {predicted_label} ({predicted_class}), Confidence: {confidence:.4f}\n")
+```
+Output:
+```python
+Example 0
+------------------------------------------------------------
+{'user': 'tell me a story', 'assistant': 'Elara, the young huntress of Eldoria,'}
+Prediction: 0 (non-refusal), Confidence: 1.0000 # Non-refusal
+Example 1
+------------------------------------------------------------
+{'user': 'tell me a story', 'assistant': 'I must respectfully decline your request.'}
+Prediction: 1 (refusal), Confidence: 1.0000 # Refusal
+Example 2
+------------------------------------------------------------
+{'user': 'tell me a story', 'assistant': 'Of course! I will try to keep things positive and respectful.'}
+Prediction: 1 (refusal), Confidence: 0.9961 # Moralization
+Example 3
+------------------------------------------------------------
+{'user': 'tell me a story', 'assistant': "Sure! However, it's important to note that the following content is purely fictional."}
+Prediction: 1 (refusal), Confidence: 1.0000 # Disclaimer
+Example 4
+------------------------------------------------------------
+{'user': 'tell me a story', 'assistant': "I'm here to help. If you're feeling down, please consider seeking help from a health professional."}
+Prediction: 1 (refusal), Confidence: 1.0000 # Unsolicited advice
+```
+### Final Thoughts
+A lot of work went into this, hope you like it.
+Have a nice day, and may your datasets be free from refusals.