RyanStudio commited on
Commit
6ca9843
Β·
verified Β·
1 Parent(s): c1f280d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -3
README.md CHANGED
@@ -1,3 +1,73 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - prompt-injection
5
+ - injection-detection
6
+ - safety
7
+ license: mit
8
+ datasets:
9
+ - RyanStudio/Mezzo-Prompt-Guard-Datasets
10
+ base_model:
11
+ - microsoft/deberta-v3-base
12
+ pipeline_tag: text-classification
13
+ ---
14
+
15
+ # Mezzo Prompt Guard Base Model Card
16
+ <a href="https://discord.gg/sBMqepFV6m"><img src="https://discord.com/api/guilds/1386414999932506197/embed.png" alt="Discord Link" height="20"></a>
17
+
18
+
19
+ The Mezzo Prompt Guard series aims to improve prompt injection and jailbreaking detection
20
+
21
+ Mezzo Prompt Guard Small was distilled from Mezzo Prompt Guard Base, and may offer greater performance and greater latency in some cases
22
+
23
+ Mezzo Prompt Guard Tiny was further distilled from Mezzo Prompt Guard Small, and offers greater performance and latency in some cases as well
24
+
25
+
26
+ To decide what models to use, I recommend the Base model for the most stability, Small for overall latency and performance, and Tiny if security is your top priority
27
+
28
+ ## Model Details
29
+
30
+ ### Model Description
31
+
32
+ The Mezzo Prompt Guard series uses DeBERTa-v3 series as the base models
33
+
34
+ I used [DeBERTa-v3-base](https://huggingface.co/microsoft/deberta-v3-base) as the base model for Mezzo Prompt Guard Base,
35
+ [DeBERTa-v3-small](https://huggingface.co/microsoft/deberta-v3-small) for Mezzo Prompt Guard Small,
36
+ and [DeBERTa-v3-xsmall](https://huggingface.co/microsoft/deberta-v3-small) for Mezzo Prompt Guard Tiny
37
+
38
+ Mezzo Prompt Guard aims to increase accuracy in detecting unsafe prompts compared to models like Llama Prompt Guard 2, and offers up to 2x better injection detection in some cases
39
+
40
+
41
+
42
+ ## Usage
43
+
44
+ Mezzo Prompt Guard 2 labels prompts as 'safe' or 'unsafe' (safe prompts were categorized as 0, and unsafe 1 during the training process)
45
+
46
+
47
+ # Performance Metrics
48
+
49
+ ## General Stats
50
+ All tests were done on a RTX 5060ti 16GB with a 128 batch
51
+
52
+ | Metric | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) | ProtectAI DeBERTa base prompt injection v2 |
53
+ |----------------------|------------------------|--------------------------|--------------------------|-----------------------------|--------------------------------------------|
54
+ | Safe β€” Accuracy | 0.9093 | 0.9195 | 0.8644 | 0.9646 βœ“ | 0.9214 |
55
+ | Safe β€” Recall | 0.9093 | 0.9195 | 0.8644 | 0.9646 βœ“ | 0.9214 |
56
+ | Safe β€” F1 | 0.8366 | 0.8437 βœ“ | 0.8247 | 0.8004 | 0.8261 |
57
+ | Injection β€” Accuracy | 0.6742 | 0.6919 | 0.7355 βœ“ | 0.4050 | 0.6213 |
58
+ | Injection β€” Recall | 0.6742 | 0.6919 | 0.7355 βœ“ | 0.4050 | 0.6213 |
59
+ | Injection β€” F1 | 0.7350 | 0.7437 | 0.7444 βœ“ | 0.5239 | 0.7008 |
60
+
61
+ Overall, the Mezzo Prompt Guard models are all better at detecting general, and more subtle prompt injections, offering almost up to 2x more coverage than Llama Prompt Guard 2
62
+
63
+ False positives are flagged more often with ambiguous prompts, and it is recommended to adjust the threshold based on your needs
64
+
65
+
66
+ ## Model Information
67
+ - **Dataset:** Mezzo Prompt Guard was trained with a large amount of public datasets, allowing it to detect well known attack patterns, as well as accounting for more modern attack methods
68
+
69
+
70
+ # Limitations
71
+ - Mezzo Prompt Guard may flag safe messages as unsafe occasionally, I recommend increasing the threshold for unsafe messages to 0.7 - 0.8 for increased accuracy
72
+ - More sophisticated attacks outside of its training data may not be able to be detected
73
+ - As the base model used (DeBERTa-v3) was primarily desgined for english, there may be limitations to its accuracy in multilingual contexts