Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,77 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- tr
|
| 6 |
+
- zh
|
| 7 |
+
- hi
|
| 8 |
+
- de
|
| 9 |
+
- fr
|
| 10 |
+
base_model:
|
| 11 |
+
- FacebookAI/xlm-roberta-base
|
| 12 |
+
pipeline_tag: text-classification
|
| 13 |
+
tags:
|
| 14 |
+
- prompt
|
| 15 |
+
- safety
|
| 16 |
+
- prompt injections
|
| 17 |
+
- prompt guard
|
| 18 |
+
- guard
|
| 19 |
+
- classification
|
| 20 |
+
---
|
| 21 |
+
# Mezzo Prompt Guard v2 Series
|
| 22 |
+
|
| 23 |
+
<a href="https://discord.gg/sBMqepFV6m"><img src="https://discord.com/api/guilds/1386414999932506197/embed.png" alt="Discord Link" height="20"></a>
|
| 24 |
+
|
| 25 |
+
Mezzo Prompt Guard v2 is the second generation of Prompt Guard models, offering significant improvements over the previous generation such as:
|
| 26 |
+
- Multilingual capabilities
|
| 27 |
+
- Decreased latency
|
| 28 |
+
- Increased accuracy and precision
|
| 29 |
+
- Lower false positive/false negative rate
|
| 30 |
+
|
| 31 |
+
# Model Info
|
| 32 |
+
## Base Model
|
| 33 |
+
- Despite our v1 models and most prompt guard models being made with DeBERTa v3, I decided to switch to RoBERTa instead after noticing significant performance increases.
|
| 34 |
+
|
| 35 |
+
- I landed on xlm-roberta large and base for Mezzo Prompt Guard v2 Large and Base models, and distilBERT-base-multilingual-cased for the smaller model,
|
| 36 |
+
these models offer significant improvements in multilingual performance compared to mdeBERTa
|
| 37 |
+
|
| 38 |
+
## Training Data
|
| 39 |
+
- More general instruction and conversational data was added to decrease the false positive rates compared to v1
|
| 40 |
+
- More examples from multilingual datasets were added in order to improve the multilingual capabilities of the model
|
| 41 |
+
|
| 42 |
+
## Training
|
| 43 |
+
- Training was done with a max seq length of 256, the model may or may not have decreased performance if prompts exceed this, its recommended to chunk prompts into lengths of 256 tokens
|
| 44 |
+
- The Large model was trained on a dataset of 200k examples, and was distilled into both the base and small models
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
## Benchmarks
|
| 48 |
+
## Overall
|
| 49 |
+
| Model | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
|
| 50 |
+
| --------- | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
|
| 51 |
+
| Precision | 0.8271 β | 0.8211 | 0.8180 | 0.7815 | 0.7905 | 0.7869 | 0.7708 |
|
| 52 |
+
| Recall | 0.8403 β | 0.8104 | 0.8147 | 0.7687 | 0.7899 | 0.7978 | 0.6829 |
|
| 53 |
+
| F1 Score | 0.8278 β | 0.8147 | 0.8162 | 0.7733 | 0.7902 | 0.7882 | 0.6854 |
|
| 54 |
+
| ROC AUC | 0.9192 | 0.9200 β | 0.9087 | 0.8774 | 0.8882 | 0.8619 | 0.8744 |
|
| 55 |
+
|
| 56 |
+
## F1 Score per Benchmark Dataset
|
| 57 |
+
| Dataset | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
|
| 58 |
+
| ------------------------------------------ | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
|
| 59 |
+
| beratcmn/turkish-prompt-injections | 0.9369 β | 0.9369 β | 0.8440 | 0.6667 | 0.6567 | 0.7030 | 0.1270 |
|
| 60 |
+
| deepset/prompt-injections | 0.8785 β | 0.7755 | 0.6813 | 0.6022 | 0.5412 | 0.5556 | 0.2353 |
|
| 61 |
+
| rikka-snow/prompt-injection-multilingual | 0.9135 | 0.9148 β | 0.8789 | 0.7536 | 0.6993 | 0.7003 | 0.1793 |
|
| 62 |
+
| rogue-security/prompt-injections-benchmark | 0.7269 | 0.6515 | 0.6888 | 0.6231 | 0.6970 | 0.7287 β | 0.6238 |
|
| 63 |
+
| xTRam1/safe-guard-prompt-injection | 0.9899 β | 0.9750 | 0.9482 | 0.9525 | 0.9769 | 0.9542 | 0.6782 |
|
| 64 |
+
|
| 65 |
+
## Specific Benchmarks
|
| 66 |
+
| Metric | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
|
| 67 |
+
| -------------------- | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
|
| 68 |
+
| **Safe Precision** | 0.9156 β | 0.8349 | 0.8469 | 0.8000 | 0.8310 | 0.8699 | 0.7101 |
|
| 69 |
+
| **Safe Recall** | 0.7896 | 0.8825 | 0.8634 | 0.8595 | 0.8342 | 0.7677 | 0.9428 β |
|
| 70 |
+
| **Safe F1 Score** | 0.8480 | 0.8580 β | 0.8551 | 0.8287 | 0.8326 | 0.8156 | 0.8101 |
|
| 71 |
+
| **Unsafe Precision** | 0.7386 | 0.8073 | 0.7891 | 0.7630 | 0.7500 | 0.7039 | 0.8314 β |
|
| 72 |
+
| **Unsafe Recall** | 0.8909 β | 0.7383 | 0.7660 | 0.6779 | 0.7456 | 0.8279 | 0.4230 |
|
| 73 |
+
| **Unsafe F1 Score** | 0.8076 β | 0.7713 | 0.7774 | 0.7179 | 0.7478 | 0.7609 | 0.5607 |
|
| 74 |
+
|
| 75 |
+
# Limitations
|
| 76 |
+
- Mezzo Prompt Guard may flag safe messages as unsafe occasionally, I recommend increasing the threshold for unsafe messages to 0.7 - 0.8 for a lower FPR, or a threshold of 0.3-0.4 for best catching prompt injections
|
| 77 |
+
- More sophisticated attacks outside of its training data may bypass the model, report examples of this in discussions to help me improve these models!
|