Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,73 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
tags:
|
| 4 |
+
- prompt-injection
|
| 5 |
+
- injection-detection
|
| 6 |
+
- safety
|
| 7 |
+
license: mit
|
| 8 |
+
datasets:
|
| 9 |
+
- RyanStudio/Mezzo-Prompt-Guard-Datasets
|
| 10 |
+
base_model:
|
| 11 |
+
- microsoft/deberta-v3-base
|
| 12 |
+
pipeline_tag: text-classification
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Mezzo Prompt Guard Base Model Card
|
| 16 |
+
<a href="https://discord.gg/sBMqepFV6m"><img src="https://discord.com/api/guilds/1386414999932506197/embed.png" alt="Discord Link" height="20"></a>
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
The Mezzo Prompt Guard series aims to improve prompt injection and jailbreaking detection
|
| 20 |
+
|
| 21 |
+
Mezzo Prompt Guard Small was distilled from Mezzo Prompt Guard Base, and may offer greater performance and greater latency in some cases
|
| 22 |
+
|
| 23 |
+
Mezzo Prompt Guard Tiny was further distilled from Mezzo Prompt Guard Small, and offers greater performance and latency in some cases as well
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
To decide what models to use, I recommend the Base model for the most stability, Small for overall latency and performance, and Tiny if security is your top priority
|
| 27 |
+
|
| 28 |
+
## Model Details
|
| 29 |
+
|
| 30 |
+
### Model Description
|
| 31 |
+
|
| 32 |
+
The Mezzo Prompt Guard series uses DeBERTa-v3 series as the base models
|
| 33 |
+
|
| 34 |
+
I used [DeBERTa-v3-base](https://huggingface.co/microsoft/deberta-v3-base) as the base model for Mezzo Prompt Guard Base,
|
| 35 |
+
[DeBERTa-v3-small](https://huggingface.co/microsoft/deberta-v3-small) for Mezzo Prompt Guard Small,
|
| 36 |
+
and [DeBERTa-v3-xsmall](https://huggingface.co/microsoft/deberta-v3-small) for Mezzo Prompt Guard Tiny
|
| 37 |
+
|
| 38 |
+
Mezzo Prompt Guard aims to increase accuracy in detecting unsafe prompts compared to models like Llama Prompt Guard 2, and offers up to 2x better injection detection in some cases
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
## Usage
|
| 43 |
+
|
| 44 |
+
Mezzo Prompt Guard 2 labels prompts as 'safe' or 'unsafe' (safe prompts were categorized as 0, and unsafe 1 during the training process)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# Performance Metrics
|
| 48 |
+
|
| 49 |
+
## General Stats
|
| 50 |
+
All tests were done on a RTX 5060ti 16GB with a 128 batch
|
| 51 |
+
|
| 52 |
+
| Metric | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) | ProtectAI DeBERTa base prompt injection v2 |
|
| 53 |
+
|----------------------|------------------------|--------------------------|--------------------------|-----------------------------|--------------------------------------------|
|
| 54 |
+
| Safe β Accuracy | 0.9093 | 0.9195 | 0.8644 | 0.9646 β | 0.9214 |
|
| 55 |
+
| Safe β Recall | 0.9093 | 0.9195 | 0.8644 | 0.9646 β | 0.9214 |
|
| 56 |
+
| Safe β F1 | 0.8366 | 0.8437 β | 0.8247 | 0.8004 | 0.8261 |
|
| 57 |
+
| Injection β Accuracy | 0.6742 | 0.6919 | 0.7355 β | 0.4050 | 0.6213 |
|
| 58 |
+
| Injection β Recall | 0.6742 | 0.6919 | 0.7355 β | 0.4050 | 0.6213 |
|
| 59 |
+
| Injection β F1 | 0.7350 | 0.7437 | 0.7444 β | 0.5239 | 0.7008 |
|
| 60 |
+
|
| 61 |
+
Overall, the Mezzo Prompt Guard models are all better at detecting general, and more subtle prompt injections, offering almost up to 2x more coverage than Llama Prompt Guard 2
|
| 62 |
+
|
| 63 |
+
False positives are flagged more often with ambiguous prompts, and it is recommended to adjust the threshold based on your needs
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
## Model Information
|
| 67 |
+
- **Dataset:** Mezzo Prompt Guard was trained with a large amount of public datasets, allowing it to detect well known attack patterns, as well as accounting for more modern attack methods
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
# Limitations
|
| 71 |
+
- Mezzo Prompt Guard may flag safe messages as unsafe occasionally, I recommend increasing the threshold for unsafe messages to 0.7 - 0.8 for increased accuracy
|
| 72 |
+
- More sophisticated attacks outside of its training data may not be able to be detected
|
| 73 |
+
- As the base model used (DeBERTa-v3) was primarily desgined for english, there may be limitations to its accuracy in multilingual contexts
|