RyanStudio commited on
Commit
a8aea93
Β·
verified Β·
1 Parent(s): 31d1e0d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - tr
6
+ - zh
7
+ - hi
8
+ - de
9
+ - fr
10
+ base_model:
11
+ - FacebookAI/xlm-roberta-base
12
+ pipeline_tag: text-classification
13
+ tags:
14
+ - prompt
15
+ - safety
16
+ - prompt injections
17
+ - prompt guard
18
+ - guard
19
+ - classification
20
+ ---
21
+ # Mezzo Prompt Guard v2 Series
22
+
23
+ <a href="https://discord.gg/sBMqepFV6m"><img src="https://discord.com/api/guilds/1386414999932506197/embed.png" alt="Discord Link" height="20"></a>
24
+
25
+ Mezzo Prompt Guard v2 is the second generation of Prompt Guard models, offering significant improvements over the previous generation such as:
26
+ - Multilingual capabilities
27
+ - Decreased latency
28
+ - Increased accuracy and precision
29
+ - Lower false positive/false negative rate
30
+
31
+ # Model Info
32
+ ## Base Model
33
+ - Despite our v1 models and most prompt guard models being made with DeBERTa v3, I decided to switch to RoBERTa instead after noticing significant performance increases.
34
+
35
+ - I landed on xlm-roberta large and base for Mezzo Prompt Guard v2 Large and Base models, and distilBERT-base-multilingual-cased for the smaller model,
36
+ these models offer significant improvements in multilingual performance compared to mdeBERTa
37
+
38
+ ## Training Data
39
+ - More general instruction and conversational data was added to decrease the false positive rates compared to v1
40
+ - More examples from multilingual datasets were added in order to improve the multilingual capabilities of the model
41
+
42
+ ## Training
43
+ - Training was done with a max seq length of 256, the model may or may not have decreased performance if prompts exceed this, its recommended to chunk prompts into lengths of 256 tokens
44
+ - The Large model was trained on a dataset of 200k examples, and was distilled into both the base and small models
45
+
46
+
47
+ ## Benchmarks
48
+ ## Overall
49
+ | Model | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
50
+ | --------- | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
51
+ | Precision | 0.8271 βœ“ | 0.8211 | 0.8180 | 0.7815 | 0.7905 | 0.7869 | 0.7708 |
52
+ | Recall | 0.8403 βœ“ | 0.8104 | 0.8147 | 0.7687 | 0.7899 | 0.7978 | 0.6829 |
53
+ | F1 Score | 0.8278 βœ“ | 0.8147 | 0.8162 | 0.7733 | 0.7902 | 0.7882 | 0.6854 |
54
+ | ROC AUC | 0.9192 | 0.9200 βœ“ | 0.9087 | 0.8774 | 0.8882 | 0.8619 | 0.8744 |
55
+
56
+ ## F1 Score per Benchmark Dataset
57
+ | Dataset | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
58
+ | ------------------------------------------ | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
59
+ | beratcmn/turkish-prompt-injections | 0.9369 βœ“ | 0.9369 βœ“ | 0.8440 | 0.6667 | 0.6567 | 0.7030 | 0.1270 |
60
+ | deepset/prompt-injections | 0.8785 βœ“ | 0.7755 | 0.6813 | 0.6022 | 0.5412 | 0.5556 | 0.2353 |
61
+ | rikka-snow/prompt-injection-multilingual | 0.9135 | 0.9148 βœ“ | 0.8789 | 0.7536 | 0.6993 | 0.7003 | 0.1793 |
62
+ | rogue-security/prompt-injections-benchmark | 0.7269 | 0.6515 | 0.6888 | 0.6231 | 0.6970 | 0.7287 βœ“ | 0.6238 |
63
+ | xTRam1/safe-guard-prompt-injection | 0.9899 βœ“ | 0.9750 | 0.9482 | 0.9525 | 0.9769 | 0.9542 | 0.6782 |
64
+
65
+ ## Specific Benchmarks
66
+ | Metric | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
67
+ | -------------------- | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
68
+ | **Safe Precision** | 0.9156 βœ“ | 0.8349 | 0.8469 | 0.8000 | 0.8310 | 0.8699 | 0.7101 |
69
+ | **Safe Recall** | 0.7896 | 0.8825 | 0.8634 | 0.8595 | 0.8342 | 0.7677 | 0.9428 βœ“ |
70
+ | **Safe F1 Score** | 0.8480 | 0.8580 βœ“ | 0.8551 | 0.8287 | 0.8326 | 0.8156 | 0.8101 |
71
+ | **Unsafe Precision** | 0.7386 | 0.8073 | 0.7891 | 0.7630 | 0.7500 | 0.7039 | 0.8314 βœ“ |
72
+ | **Unsafe Recall** | 0.8909 βœ“ | 0.7383 | 0.7660 | 0.6779 | 0.7456 | 0.8279 | 0.4230 |
73
+ | **Unsafe F1 Score** | 0.8076 βœ“ | 0.7713 | 0.7774 | 0.7179 | 0.7478 | 0.7609 | 0.5607 |
74
+
75
+ # Limitations
76
+ - Mezzo Prompt Guard may flag safe messages as unsafe occasionally, I recommend increasing the threshold for unsafe messages to 0.7 - 0.8 for a lower FPR, or a threshold of 0.3-0.4 for best catching prompt injections
77
+ - More sophisticated attacks outside of its training data may bypass the model, report examples of this in discussions to help me improve these models!