FinancialSupport
/

Oracolo

+# Oracolo Model Card
+## Model Description
+Oracolo is a DeBERTa-based content moderation model trained to detect harmful content across multiple safety categories. The model analyzes both questions and answers in conversational contexts to determine potential safety risks.
+## Intended Use
+- Content moderation for conversational AI systems
+- Safety classification of question-answer pairs
+- Multi-label classification across safety categories
+## Training Data
+The model was trained on the BeaverTails dataset, which contains labeled examples of safe and unsafe conversational content.
+## Model Architecture
+- Base model: Microsoft DeBERTa
+- Multi-label classification head
+- Maximum sequence length: 512 tokens
+## Performance
+### Overall Safety Classification
+Comparison with PKU-Alignment/beaver-dam-7b on test set:
+| Model | Precision | Recall | F1-Score |
+|-------|-----------|--------|----------|
+| Oracolo (DeBERTa) | 0.85 | 0.85 | 0.85 |
+| beaver-dam-7b | 0.77 | 0.88 | 0.87 |
+### Usage Example
+```python
+  from transformers import AutoTokenizer, AutoModelForSequenceClassification
+  import torch
+  def preprocess_text(prompt, response=""):
+    """Format text in the same way as during training."""
+    return f"<prompt> {prompt} </prompt> <response> {response} </response>"
+  model = AutoModelForSequenceClassification.from_pretrained("path/to/oracolo")
+  tokenizer = AutoTokenizer.from_pretrained("path/to/oracolo")
+  model.eval()
+  prompt = "How do I make a walkway slippery?"
+  response = "I cannot provide advice that could lead to harm."
+  formatted_text = preprocess_text(prompt, response)
+  inputs = tokenizer(formatted_text, return_tensors="pt", truncation=True, max_length=512)
+  with torch.no_grad():
+    outputs = model(inputs)
+    predictions = torch.sigmoid(outputs.logits).cpu().numpy()[0]
+    # Apply threshold (0.3 recommended based on validation)
+    class_predictions = (predictions > 0.3).astype(int)
+```
+## Full classification report
+=== Per-Category Classification Report for both Model ===
+Category: animal_abuse
+BERT
+                  precision    recall  f1-score   support
+Not animal_abuse       1.00      0.99      0.99        99
+    animal_abuse       0.50      1.00      0.67         1
+        accuracy                           0.99       100
+       macro avg       0.75      0.99      0.83       100
+    weighted avg       0.99      0.99      0.99       100
+QA
+                  precision    recall  f1-score   support
+Not animal_abuse       1.00      0.99      0.99        99
+    animal_abuse       0.50      1.00      0.67         1
+        accuracy                           0.99       100
+       macro avg       0.75      0.99      0.83       100
+    weighted avg       0.99      0.99      0.99       100
+Category: child_abuse
+BERT
+                 precision    recall  f1-score   support
+Not child_abuse       0.99      0.99      0.99        99
+    child_abuse       0.00      0.00      0.00         1
+       accuracy                           0.98       100
+      macro avg       0.49      0.49      0.49       100
+   weighted avg       0.98      0.98      0.98       100
+QA
+                 precision    recall  f1-score   support
+Not child_abuse       0.99      0.99      0.99        99
+    child_abuse       0.00      0.00      0.00         1
+       accuracy                           0.98       100
+      macro avg       0.49      0.49      0.49       100
+   weighted avg       0.98      0.98      0.98       100
+Category: controversial_topics,politics
+BERT
+                                   precision    recall  f1-score   support
+Not controversial_topics,politics       0.99      1.00      0.99        97
+    controversial_topics,politics       1.00      0.67      0.80         3
+                         accuracy                           0.99       100
+                        macro avg       0.99      0.83      0.90       100
+                     weighted avg       0.99      0.99      0.99       100
+QA
+                                   precision    recall  f1-score   support
+Not controversial_topics,politics       0.99      1.00      0.99        97
+    controversial_topics,politics       1.00      0.67      0.80         3
+                         accuracy                           0.99       100
+                        macro avg       0.99      0.83      0.90       100
+                     weighted avg       0.99      0.99      0.99       100
+Category: discrimination,stereotype,injustice
+BERT
+                                         precision    recall  f1-score   support
+Not discrimination,stereotype,injustice       0.98      0.95      0.96        94
+    discrimination,stereotype,injustice       0.44      0.67      0.53         6
+                               accuracy                           0.93       100
+                              macro avg       0.71      0.81      0.75       100
+                           weighted avg       0.95      0.93      0.94       100
+QA
+                                         precision    recall  f1-score   support
+Not discrimination,stereotype,injustice       0.99      0.98      0.98        94
+    discrimination,stereotype,injustice       0.71      0.83      0.77         6
+                               accuracy                           0.97       100
+                              macro avg       0.85      0.91      0.88       100
+                           weighted avg       0.97      0.97      0.97       100
+Category: drug_abuse,weapons,banned_substance
+BERT
+                                         precision    recall  f1-score   support
+Not drug_abuse,weapons,banned_substance       1.00      0.96      0.98        96
+    drug_abuse,weapons,banned_substance       0.50      1.00      0.67         4
+                               accuracy                           0.96       100
+                              macro avg       0.75      0.98      0.82       100
+                           weighted avg       0.98      0.96      0.97       100
+QA
+                                         precision    recall  f1-score   support
+Not drug_abuse,weapons,banned_substance       0.98      0.99      0.98        96
+    drug_abuse,weapons,banned_substance       0.67      0.50      0.57         4
+                               accuracy                           0.97       100
+                              macro avg       0.82      0.74      0.78       100
+                           weighted avg       0.97      0.97      0.97       100
+Category: financial_crime,property_crime,theft
+BERT
+                                          precision    recall  f1-score   support
+Not financial_crime,property_crime,theft       0.98      0.98      0.98        95
+    financial_crime,property_crime,theft       0.60      0.60      0.60         5
+                                accuracy                           0.96       100
+                               macro avg       0.79      0.79      0.79       100
+                            weighted avg       0.96      0.96      0.96       100
+QA
+                                          precision    recall  f1-score   support
+Not financial_crime,property_crime,theft       0.99      0.99      0.99        95
+    financial_crime,property_crime,theft       0.80      0.80      0.80         5
+                                accuracy                           0.98       100
+                               macro avg       0.89      0.89      0.89       100
+                            weighted avg       0.98      0.98      0.98       100
+Category: hate_speech,offensive_language
+BERT
+                                    precision    recall  f1-score   support
+Not hate_speech,offensive_language       0.95      0.98      0.96        93
+    hate_speech,offensive_language       0.50      0.29      0.36         7
+                          accuracy                           0.93       100
+                         macro avg       0.72      0.63      0.66       100
+                      weighted avg       0.92      0.93      0.92       100
+QA
+                                    precision    recall  f1-score   support
+Not hate_speech,offensive_language       0.96      1.00      0.98        93
+    hate_speech,offensive_language       1.00      0.43      0.60         7
+                          accuracy                           0.96       100
+                         macro avg       0.98      0.71      0.79       100
+                      weighted avg       0.96      0.96      0.95       100
+Category: misinformation_regarding_ethics,laws_and_safety
+BERT
+                                                     precision    recall  f1-score   support
+Not misinformation_regarding_ethics,laws_and_safety       0.98      1.00      0.99        98
+    misinformation_regarding_ethics,laws_and_safety       0.00      0.00      0.00         2
+                                           accuracy                           0.98       100
+                                          macro avg       0.49      0.50      0.49       100
+                                       weighted avg       0.96      0.98      0.97       100
+QA
+                                                     precision    recall  f1-score   support
+Not misinformation_regarding_ethics,laws_and_safety       0.98      1.00      0.99        98
+    misinformation_regarding_ethics,laws_and_safety       0.00      0.00      0.00         2
+                                           accuracy                           0.98       100
+                                          macro avg       0.49      0.50      0.49       100
+                                       weighted avg       0.96      0.98      0.97       100
+Category: non_violent_unethical_behavior
+BERT
+                                    precision    recall  f1-score   support
+Not non_violent_unethical_behavior       0.87      0.87      0.87        77
+    non_violent_unethical_behavior       0.57      0.57      0.57        23
+                          accuracy                           0.80       100
+                         macro avg       0.72      0.72      0.72       100
+                      weighted avg       0.80      0.80      0.80       100
+QA
+                                    precision    recall  f1-score   support
+Not non_violent_unethical_behavior       0.90      0.95      0.92        77
+    non_violent_unethical_behavior       0.79      0.65      0.71        23
+                          accuracy                           0.88       100
+                         macro avg       0.85      0.80      0.82       100
+                      weighted avg       0.88      0.88      0.88       100
+Category: privacy_violation
+BERT
+                       precision    recall  f1-score   support
+Not privacy_violation       1.00      1.00      1.00        97
+    privacy_violation       1.00      1.00      1.00         3
+             accuracy                           1.00       100
+            macro avg       1.00      1.00      1.00       100
+         weighted avg       1.00      1.00      1.00       100
+QA
+                       precision    recall  f1-score   support
+Not privacy_violation       1.00      1.00      1.00        97
+    privacy_violation       1.00      1.00      1.00         3
+             accuracy                           1.00       100
+            macro avg       1.00      1.00      1.00       100
+         weighted avg       1.00      1.00      1.00       100
+Category: self_harm
+Only class 0 present in this category
+Category: sexually_explicit,adult_content
+BERT
+                                     precision    recall  f1-score   support
+Not sexually_explicit,adult_content       0.99      1.00      0.99        95
+    sexually_explicit,adult_content       1.00      0.80      0.89         5
+                           accuracy                           0.99       100
+                          macro avg       0.99      0.90      0.94       100
+                       weighted avg       0.99      0.99      0.99       100
+QA
+                                     precision    recall  f1-score   support
+Not sexually_explicit,adult_content       0.99      1.00      0.99        95
+    sexually_explicit,adult_content       1.00      0.80      0.89         5
+                           accuracy                           0.99       100
+                          macro avg       0.99      0.90      0.94       100
+                       weighted avg       0.99      0.99      0.99       100
+Category: terrorism,organized_crime
+BERT
+                               precision    recall  f1-score   support
+Not terrorism,organized_crime       0.98      0.99      0.98        98
+    terrorism,organized_crime       0.00      0.00      0.00         2
+                     accuracy                           0.97       100
+                    macro avg       0.49      0.49      0.49       100
+                 weighted avg       0.96      0.97      0.97       100
+QA
+                               precision    recall  f1-score   support
+Not terrorism,organized_crime       0.98      0.99      0.98        98
+    terrorism,organized_crime       0.00      0.00      0.00         2
+                     accuracy                           0.97       100
+                    macro avg       0.49      0.49      0.49       100
+                 weighted avg       0.96      0.97      0.97       100
+Category: violence,aiding_and_abetting,incitement
+BERT
+                                             precision    recall  f1-score   support
+Not violence,aiding_and_abetting,incitement       0.92      0.93      0.92        72
+    violence,aiding_and_abetting,incitement       0.81      0.79      0.80        28
+                                   accuracy                           0.89       100
+                                  macro avg       0.87      0.86      0.86       100
+                               weighted avg       0.89      0.89      0.89       100
+QA
+                                             precision    recall  f1-score   support
+Not violence,aiding_and_abetting,incitement       0.91      0.99      0.95        72
+    violence,aiding_and_abetting,incitement       0.95      0.75      0.84        28
+                                   accuracy                           0.92       100
+                                  macro avg       0.93      0.87      0.89       100
+                               weighted avg       0.92      0.92      0.92       100