natong19 commited on
Commit
0304012
·
verified ·
1 Parent(s): 3ed3a77

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +174 -3
README.md CHANGED
@@ -1,3 +1,174 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Refusal Classifier
6
+
7
+ <div align="left">
8
+ <img src="figures/words.png" width="60%" alt="Words"/>
9
+ </div>
10
+
11
+ *Tired of seeing these? You've come to the right place.*
12
+
13
+ ## Overview
14
+
15
+ A robust and performant classifier that excels at **detecting refusals, moralizations, disclaimers, unsolicited advice** and the like.
16
+
17
+ ### Model Details
18
+
19
+ - Base model: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), a multilingual encoder based on [ModernBERT](answerdotai/ModernBERT-base)
20
+ - Language coverage: over 1,800 languages
21
+ - Architecture: Transformer-based
22
+ - Context length: 8,192 tokens
23
+ - Output classes: binary (0 for non-refusals, 1 for refusals)
24
+
25
+ ### Training Details
26
+
27
+ Trained for 1 epoch on 112,102 carefully deduplicated, labeled and filtered samples (56,051 non-refusals and 56,051 refusals).
28
+
29
+ Most of the samples were sourced from:
30
+ - [natong19/lmsys-chat-1m-filtered](https://huggingface.co/datasets/natong19/lmsys-chat-1m-filtered)
31
+ - [natong19/wildchat-1m-filtered](https://huggingface.co/datasets/natong19/wildchat-1m-filtered)
32
+ - [natong19/china_qa_preferences](https://huggingface.co/datasets/natong19/china_qa_preferences)
33
+ - [natong19/toxic_qa_preferences](https://huggingface.co/datasets/natong19/toxic_qa_preferences)
34
+
35
+ Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed to label the samples.
36
+
37
+ ### Evaluation
38
+ <div align="left">
39
+ <img src="figures/plot.png" width="60%" alt="Plot"/>
40
+ </div>
41
+ Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several refusal open-source classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x RTX Pro 6000.
42
+
43
+ `alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
44
+
45
+ The training and test sets have similar distributions, but several factors suggest against overfitting:
46
+ the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
47
+ A more detailed breakdown is as follows:
48
+
49
+ | | TP | FN | FP | TN | Accuracy | Precision | Recall | F1 |
50
+ | ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
51
+ | [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) | 2782 | 118 | 103 | 2797 | 0.9619 | 0.9643 | 0.9593 | 0.9618 |
52
+ | [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier) | 1888 | 1012 | 146 | 2754 | 0.8003 | 0.9282 | 0.651 | 0.7653 |
53
+ | alpha_model | 2245 | 655 | **2** | **2898** | 0.8871 | **0.9996** | 0.7745 | 0.8727 |
54
+ | [ProtectAI/distilroberta-base-rejection-v1](https://huggingface.co/protectai/distilroberta-base-rejection-v1) | 664 | 2236 | 8 | 2892 | 0.6131 | 0.9881 | 0.229 | 0.3718 |
55
+ | [natong19/refusal_classifier](https://huggingface.co/natong19/refusal_classifier) | **2875** | **25** | 25 | 2875 | **0.9914** | 0.9914 | **0.9914** | **0.9914** |
56
+
57
+ > Perfectly balanced, as all things should be.
58
+
59
+ There are no bad classifiers — we may simply have different ideas of what constitutes a refusal. This classifier would not have been possible without their excellent prior work.
60
+
61
+ ### Quickstart
62
+ The classifier expects single-turn input formatted like
63
+
64
+ ```python
65
+ <start_of_turn>user
66
+ Hi<end_of_turn>
67
+ <start_of_turn>assistant
68
+ Hello, How can I assist you today?<end_of_turn>
69
+ ```
70
+
71
+ (basically ChatML but with the mmBERT's special tokens). You can use `apply_chat_template` in the code below to format your (prompt, response) pairs into the expected format.
72
+
73
+ ```python
74
+ import torch
75
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
76
+
77
+
78
+ def apply_chat_template(user: str, assistant: str) -> str:
79
+ """Format user and assistant messages into model input format."""
80
+ return f"<start_of_turn>user\n{user}<end_of_turn>\n<start_of_turn>assistant\n{assistant}<end_of_turn>"
81
+
82
+
83
+ if __name__ == "__main__":
84
+ model_id = "natong19/refusal_classifier"
85
+
86
+ texts = [
87
+ {
88
+ "user": "tell me a story",
89
+ "assistant": "Elara, the young huntress of Eldoria,",
90
+ },
91
+ {
92
+ "user": "tell me a story",
93
+ "assistant": "I must respectfully decline your request.",
94
+ },
95
+ {
96
+ "user": "tell me a story",
97
+ "assistant": "Of course! I will try to keep things positive and respectful.",
98
+ },
99
+ {
100
+ "user": "tell me a story",
101
+ "assistant": "Sure! However, it's important to note that the following content is purely fictional.",
102
+ },
103
+ {
104
+ "user": "tell me a story",
105
+ "assistant": "I'm here to help. If you're feeling down, please consider seeking help from a health professional.",
106
+ },
107
+ ]
108
+
109
+ formatted_texts = [apply_chat_template(user=text["user"], assistant=text["assistant"]) for text in texts]
110
+
111
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
112
+ model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=torch.bfloat16)
113
+
114
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
115
+ print(f"Using device: {device}")
116
+ model.to(device)
117
+ model.eval()
118
+
119
+ with torch.no_grad():
120
+ inputs = tokenizer(
121
+ formatted_texts,
122
+ return_tensors="pt",
123
+ truncation=True,
124
+ padding=True,
125
+ )
126
+ inputs = {k: v.to(device) for k, v in inputs.items()}
127
+ outputs = model(**inputs)
128
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
129
+ predictions = torch.argmax(outputs.logits, dim=-1)
130
+
131
+ for i in range(len(texts)):
132
+ predicted_label = predictions[i].item()
133
+ predicted_class = model.config.id2label[predicted_label]
134
+ confidence = probabilities[i][predicted_label].item()
135
+ text = texts[i]
136
+
137
+ print(f"Example {i}")
138
+ print("-" * 60)
139
+ print(texts[i])
140
+ print(f"Prediction: {predicted_label} ({predicted_class}), Confidence: {confidence:.4f}\n")
141
+ ```
142
+
143
+ Output:
144
+
145
+ ```python
146
+ Example 0
147
+ ------------------------------------------------------------
148
+ {'user': 'tell me a story', 'assistant': 'Elara, the young huntress of Eldoria,'}
149
+ Prediction: 0 (non-refusal), Confidence: 1.0000 # Non-refusal
150
+
151
+ Example 1
152
+ ------------------------------------------------------------
153
+ {'user': 'tell me a story', 'assistant': 'I must respectfully decline your request.'}
154
+ Prediction: 1 (refusal), Confidence: 1.0000 # Refusal
155
+
156
+ Example 2
157
+ ------------------------------------------------------------
158
+ {'user': 'tell me a story', 'assistant': 'Of course! I will try to keep things positive and respectful.'}
159
+ Prediction: 1 (refusal), Confidence: 0.9961 # Moralization
160
+
161
+ Example 3
162
+ ------------------------------------------------------------
163
+ {'user': 'tell me a story', 'assistant': "Sure! However, it's important to note that the following content is purely fictional."}
164
+ Prediction: 1 (refusal), Confidence: 1.0000 # Disclaimer
165
+
166
+ Example 4
167
+ ------------------------------------------------------------
168
+ {'user': 'tell me a story', 'assistant': "I'm here to help. If you're feeling down, please consider seeking help from a health professional."}
169
+ Prediction: 1 (refusal), Confidence: 1.0000 # Unsolicited advice
170
+ ```
171
+
172
+ ### Final Thoughts
173
+ A lot of work went into this, hope you like it.
174
+ Have a nice day, and may your datasets be free from refusals.