FinancialSupport commited on
Commit
f98e7c6
·
verified ·
1 Parent(s): a06fa29

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +345 -0
README.md ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Oracolo Model Card
2
+
3
+ ## Model Description
4
+ Oracolo is a DeBERTa-based content moderation model trained to detect harmful content across multiple safety categories. The model analyzes both questions and answers in conversational contexts to determine potential safety risks.
5
+
6
+ ## Intended Use
7
+ - Content moderation for conversational AI systems
8
+ - Safety classification of question-answer pairs
9
+ - Multi-label classification across safety categories
10
+
11
+ ## Training Data
12
+ The model was trained on the BeaverTails dataset, which contains labeled examples of safe and unsafe conversational content.
13
+
14
+ ## Model Architecture
15
+ - Base model: Microsoft DeBERTa
16
+ - Multi-label classification head
17
+ - Maximum sequence length: 512 tokens
18
+
19
+ ## Performance
20
+
21
+ ### Overall Safety Classification
22
+ Comparison with PKU-Alignment/beaver-dam-7b on test set:
23
+
24
+ | Model | Precision | Recall | F1-Score |
25
+ |-------|-----------|--------|----------|
26
+ | Oracolo (DeBERTa) | 0.85 | 0.85 | 0.85 |
27
+ | beaver-dam-7b | 0.77 | 0.88 | 0.87 |
28
+
29
+ ### Usage Example
30
+ ```python
31
+
32
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
33
+ import torch
34
+ def preprocess_text(prompt, response=""):
35
+ """Format text in the same way as during training."""
36
+ return f"<prompt> {prompt} </prompt> <response> {response} </response>"
37
+
38
+ model = AutoModelForSequenceClassification.from_pretrained("path/to/oracolo")
39
+ tokenizer = AutoTokenizer.from_pretrained("path/to/oracolo")
40
+ model.eval()
41
+ prompt = "How do I make a walkway slippery?"
42
+ response = "I cannot provide advice that could lead to harm."
43
+ formatted_text = preprocess_text(prompt, response)
44
+ inputs = tokenizer(formatted_text, return_tensors="pt", truncation=True, max_length=512)
45
+ with torch.no_grad():
46
+ outputs = model(inputs)
47
+ predictions = torch.sigmoid(outputs.logits).cpu().numpy()[0]
48
+ # Apply threshold (0.3 recommended based on validation)
49
+ class_predictions = (predictions > 0.3).astype(int)
50
+
51
+ ```
52
+ ## Full classification report
53
+
54
+
55
+ === Per-Category Classification Report for both Model ===
56
+
57
+ Category: animal_abuse
58
+ BERT
59
+ precision recall f1-score support
60
+
61
+ Not animal_abuse 1.00 0.99 0.99 99
62
+ animal_abuse 0.50 1.00 0.67 1
63
+
64
+ accuracy 0.99 100
65
+ macro avg 0.75 0.99 0.83 100
66
+ weighted avg 0.99 0.99 0.99 100
67
+
68
+ QA
69
+ precision recall f1-score support
70
+
71
+ Not animal_abuse 1.00 0.99 0.99 99
72
+ animal_abuse 0.50 1.00 0.67 1
73
+
74
+ accuracy 0.99 100
75
+ macro avg 0.75 0.99 0.83 100
76
+ weighted avg 0.99 0.99 0.99 100
77
+
78
+
79
+ Category: child_abuse
80
+ BERT
81
+ precision recall f1-score support
82
+
83
+ Not child_abuse 0.99 0.99 0.99 99
84
+ child_abuse 0.00 0.00 0.00 1
85
+
86
+ accuracy 0.98 100
87
+ macro avg 0.49 0.49 0.49 100
88
+ weighted avg 0.98 0.98 0.98 100
89
+
90
+ QA
91
+ precision recall f1-score support
92
+
93
+ Not child_abuse 0.99 0.99 0.99 99
94
+ child_abuse 0.00 0.00 0.00 1
95
+
96
+ accuracy 0.98 100
97
+ macro avg 0.49 0.49 0.49 100
98
+ weighted avg 0.98 0.98 0.98 100
99
+
100
+
101
+ Category: controversial_topics,politics
102
+ BERT
103
+ precision recall f1-score support
104
+
105
+ Not controversial_topics,politics 0.99 1.00 0.99 97
106
+ controversial_topics,politics 1.00 0.67 0.80 3
107
+
108
+ accuracy 0.99 100
109
+ macro avg 0.99 0.83 0.90 100
110
+ weighted avg 0.99 0.99 0.99 100
111
+
112
+ QA
113
+ precision recall f1-score support
114
+
115
+ Not controversial_topics,politics 0.99 1.00 0.99 97
116
+ controversial_topics,politics 1.00 0.67 0.80 3
117
+
118
+ accuracy 0.99 100
119
+ macro avg 0.99 0.83 0.90 100
120
+ weighted avg 0.99 0.99 0.99 100
121
+
122
+
123
+ Category: discrimination,stereotype,injustice
124
+ BERT
125
+ precision recall f1-score support
126
+
127
+ Not discrimination,stereotype,injustice 0.98 0.95 0.96 94
128
+ discrimination,stereotype,injustice 0.44 0.67 0.53 6
129
+
130
+ accuracy 0.93 100
131
+ macro avg 0.71 0.81 0.75 100
132
+ weighted avg 0.95 0.93 0.94 100
133
+
134
+ QA
135
+ precision recall f1-score support
136
+
137
+ Not discrimination,stereotype,injustice 0.99 0.98 0.98 94
138
+ discrimination,stereotype,injustice 0.71 0.83 0.77 6
139
+
140
+ accuracy 0.97 100
141
+ macro avg 0.85 0.91 0.88 100
142
+ weighted avg 0.97 0.97 0.97 100
143
+
144
+
145
+ Category: drug_abuse,weapons,banned_substance
146
+ BERT
147
+ precision recall f1-score support
148
+
149
+ Not drug_abuse,weapons,banned_substance 1.00 0.96 0.98 96
150
+ drug_abuse,weapons,banned_substance 0.50 1.00 0.67 4
151
+
152
+ accuracy 0.96 100
153
+ macro avg 0.75 0.98 0.82 100
154
+ weighted avg 0.98 0.96 0.97 100
155
+
156
+ QA
157
+ precision recall f1-score support
158
+
159
+ Not drug_abuse,weapons,banned_substance 0.98 0.99 0.98 96
160
+ drug_abuse,weapons,banned_substance 0.67 0.50 0.57 4
161
+
162
+ accuracy 0.97 100
163
+ macro avg 0.82 0.74 0.78 100
164
+ weighted avg 0.97 0.97 0.97 100
165
+
166
+
167
+ Category: financial_crime,property_crime,theft
168
+ BERT
169
+ precision recall f1-score support
170
+
171
+ Not financial_crime,property_crime,theft 0.98 0.98 0.98 95
172
+ financial_crime,property_crime,theft 0.60 0.60 0.60 5
173
+
174
+ accuracy 0.96 100
175
+ macro avg 0.79 0.79 0.79 100
176
+ weighted avg 0.96 0.96 0.96 100
177
+
178
+ QA
179
+ precision recall f1-score support
180
+
181
+ Not financial_crime,property_crime,theft 0.99 0.99 0.99 95
182
+ financial_crime,property_crime,theft 0.80 0.80 0.80 5
183
+
184
+ accuracy 0.98 100
185
+ macro avg 0.89 0.89 0.89 100
186
+ weighted avg 0.98 0.98 0.98 100
187
+
188
+
189
+ Category: hate_speech,offensive_language
190
+ BERT
191
+ precision recall f1-score support
192
+
193
+ Not hate_speech,offensive_language 0.95 0.98 0.96 93
194
+ hate_speech,offensive_language 0.50 0.29 0.36 7
195
+
196
+ accuracy 0.93 100
197
+ macro avg 0.72 0.63 0.66 100
198
+ weighted avg 0.92 0.93 0.92 100
199
+
200
+ QA
201
+ precision recall f1-score support
202
+
203
+ Not hate_speech,offensive_language 0.96 1.00 0.98 93
204
+ hate_speech,offensive_language 1.00 0.43 0.60 7
205
+
206
+ accuracy 0.96 100
207
+ macro avg 0.98 0.71 0.79 100
208
+ weighted avg 0.96 0.96 0.95 100
209
+
210
+
211
+ Category: misinformation_regarding_ethics,laws_and_safety
212
+ BERT
213
+ precision recall f1-score support
214
+
215
+ Not misinformation_regarding_ethics,laws_and_safety 0.98 1.00 0.99 98
216
+ misinformation_regarding_ethics,laws_and_safety 0.00 0.00 0.00 2
217
+
218
+ accuracy 0.98 100
219
+ macro avg 0.49 0.50 0.49 100
220
+ weighted avg 0.96 0.98 0.97 100
221
+
222
+ QA
223
+ precision recall f1-score support
224
+
225
+ Not misinformation_regarding_ethics,laws_and_safety 0.98 1.00 0.99 98
226
+ misinformation_regarding_ethics,laws_and_safety 0.00 0.00 0.00 2
227
+
228
+ accuracy 0.98 100
229
+ macro avg 0.49 0.50 0.49 100
230
+ weighted avg 0.96 0.98 0.97 100
231
+
232
+
233
+ Category: non_violent_unethical_behavior
234
+ BERT
235
+ precision recall f1-score support
236
+
237
+ Not non_violent_unethical_behavior 0.87 0.87 0.87 77
238
+ non_violent_unethical_behavior 0.57 0.57 0.57 23
239
+
240
+ accuracy 0.80 100
241
+ macro avg 0.72 0.72 0.72 100
242
+ weighted avg 0.80 0.80 0.80 100
243
+
244
+ QA
245
+ precision recall f1-score support
246
+
247
+ Not non_violent_unethical_behavior 0.90 0.95 0.92 77
248
+ non_violent_unethical_behavior 0.79 0.65 0.71 23
249
+
250
+ accuracy 0.88 100
251
+ macro avg 0.85 0.80 0.82 100
252
+ weighted avg 0.88 0.88 0.88 100
253
+
254
+
255
+ Category: privacy_violation
256
+ BERT
257
+ precision recall f1-score support
258
+
259
+ Not privacy_violation 1.00 1.00 1.00 97
260
+ privacy_violation 1.00 1.00 1.00 3
261
+
262
+ accuracy 1.00 100
263
+ macro avg 1.00 1.00 1.00 100
264
+ weighted avg 1.00 1.00 1.00 100
265
+
266
+ QA
267
+ precision recall f1-score support
268
+
269
+ Not privacy_violation 1.00 1.00 1.00 97
270
+ privacy_violation 1.00 1.00 1.00 3
271
+
272
+ accuracy 1.00 100
273
+ macro avg 1.00 1.00 1.00 100
274
+ weighted avg 1.00 1.00 1.00 100
275
+
276
+
277
+ Category: self_harm
278
+ Only class 0 present in this category
279
+
280
+ Category: sexually_explicit,adult_content
281
+ BERT
282
+ precision recall f1-score support
283
+
284
+ Not sexually_explicit,adult_content 0.99 1.00 0.99 95
285
+ sexually_explicit,adult_content 1.00 0.80 0.89 5
286
+
287
+ accuracy 0.99 100
288
+ macro avg 0.99 0.90 0.94 100
289
+ weighted avg 0.99 0.99 0.99 100
290
+
291
+ QA
292
+ precision recall f1-score support
293
+
294
+ Not sexually_explicit,adult_content 0.99 1.00 0.99 95
295
+ sexually_explicit,adult_content 1.00 0.80 0.89 5
296
+
297
+ accuracy 0.99 100
298
+ macro avg 0.99 0.90 0.94 100
299
+ weighted avg 0.99 0.99 0.99 100
300
+
301
+
302
+ Category: terrorism,organized_crime
303
+ BERT
304
+ precision recall f1-score support
305
+
306
+ Not terrorism,organized_crime 0.98 0.99 0.98 98
307
+ terrorism,organized_crime 0.00 0.00 0.00 2
308
+
309
+ accuracy 0.97 100
310
+ macro avg 0.49 0.49 0.49 100
311
+ weighted avg 0.96 0.97 0.97 100
312
+
313
+ QA
314
+ precision recall f1-score support
315
+
316
+ Not terrorism,organized_crime 0.98 0.99 0.98 98
317
+ terrorism,organized_crime 0.00 0.00 0.00 2
318
+
319
+ accuracy 0.97 100
320
+ macro avg 0.49 0.49 0.49 100
321
+ weighted avg 0.96 0.97 0.97 100
322
+
323
+
324
+ Category: violence,aiding_and_abetting,incitement
325
+ BERT
326
+ precision recall f1-score support
327
+
328
+ Not violence,aiding_and_abetting,incitement 0.92 0.93 0.92 72
329
+ violence,aiding_and_abetting,incitement 0.81 0.79 0.80 28
330
+
331
+ accuracy 0.89 100
332
+ macro avg 0.87 0.86 0.86 100
333
+ weighted avg 0.89 0.89 0.89 100
334
+
335
+ QA
336
+ precision recall f1-score support
337
+
338
+ Not violence,aiding_and_abetting,incitement 0.91 0.99 0.95 72
339
+ violence,aiding_and_abetting,incitement 0.95 0.75 0.84 28
340
+
341
+ accuracy 0.92 100
342
+ macro avg 0.93 0.87 0.89 100
343
+ weighted avg 0.92 0.92 0.92 100
344
+
345
+