RyanStudio
/

Mezzo-Prompt-Guard-v2-Base

+---
+license: mit
+language:
+- en
+- tr
+- zh
+- hi
+- de
+- fr
+base_model:
+- FacebookAI/xlm-roberta-base
+pipeline_tag: text-classification
+tags:
+- prompt
+- safety
+- prompt injections
+- prompt guard
+- guard
+- classification
+---
+# Mezzo Prompt Guard v2 Series
+<a href="https://discord.gg/sBMqepFV6m"><img src="https://discord.com/api/guilds/1386414999932506197/embed.png" alt="Discord Link" height="20"></a>
+Mezzo Prompt Guard v2 is the second generation of Prompt Guard models, offering significant improvements over the previous generation such as:
+- Multilingual capabilities
+- Decreased latency
+- Increased accuracy and precision
+- Lower false positive/false negative rate
+# Model Info
+## Base Model
+- Despite our v1 models and most prompt guard models being made with DeBERTa v3, I decided to switch to RoBERTa instead after noticing significant performance increases.
+- I landed on xlm-roberta large and base for Mezzo Prompt Guard v2 Large and Base models, and distilBERT-base-multilingual-cased for the smaller model,
+these models offer significant improvements in multilingual performance compared to mdeBERTa
+## Training Data
+- More general instruction and conversational data was added to decrease the false positive rates compared to v1
+- More examples from multilingual datasets were added in order to improve the multilingual capabilities of the model
+## Training
+- Training was done with a max seq length of 256, the model may or may not have decreased performance if prompts exceed this, its recommended to chunk prompts into lengths of 256 tokens
+- The Large model was trained on a dataset of 200k examples, and was distilled into both the base and small models
+## Benchmarks
+## Overall
+| Model     | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
+| --------- | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
+| Precision | 0.8271 ✓                    | 0.8211                     | 0.8180                      | 0.7815                  | 0.7905                   | 0.7869                  | 0.7708                     |
+| Recall    | 0.8403 ✓                    | 0.8104                     | 0.8147                      | 0.7687                  | 0.7899                   | 0.7978                  | 0.6829                     |
+| F1 Score  | 0.8278 ✓                    | 0.8147                     | 0.8162                      | 0.7733                  | 0.7902                   | 0.7882                  | 0.6854                     |
+| ROC AUC   | 0.9192                      | 0.9200 ✓                   | 0.9087                      | 0.8774                  | 0.8882                   | 0.8619                  | 0.8744                     |
+## F1 Score per Benchmark Dataset
+| Dataset                                    | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
+| ------------------------------------------ | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
+| beratcmn/turkish-prompt-injections         | 0.9369 ✓                    | 0.9369 ✓                   | 0.8440                      | 0.6667                  | 0.6567                   | 0.7030                  | 0.1270                     |
+| deepset/prompt-injections                  | 0.8785 ✓                    | 0.7755                     | 0.6813                      | 0.6022                  | 0.5412                   | 0.5556                  | 0.2353                     |
+| rikka-snow/prompt-injection-multilingual   | 0.9135                      | 0.9148 ✓                   | 0.8789                      | 0.7536                  | 0.6993                   | 0.7003                  | 0.1793                     |
+| rogue-security/prompt-injections-benchmark | 0.7269                      | 0.6515                     | 0.6888                      | 0.6231                  | 0.6970                   | 0.7287 ✓                | 0.6238                     |
+| xTRam1/safe-guard-prompt-injection         | 0.9899 ✓                    | 0.9750                     | 0.9482                      | 0.9525                  | 0.9769                   | 0.9542                  | 0.6782                     |
+## Specific Benchmarks
+| Metric               | Mezzo Prompt Guard v2 Large | Mezzo Prompt Guard v2 Base | Mezzo Prompt Guard v2 Small | Mezzo Prompt Guard Base | Mezzo Prompt Guard Small | Mezzo Prompt Guard Tiny | Llama Prompt Guard 2 (86M) |
+| -------------------- | --------------------------- | -------------------------- | --------------------------- | ----------------------- | ------------------------ | ----------------------- | -------------------------- |
+| **Safe Precision**   | 0.9156 ✓                    | 0.8349                     | 0.8469                      | 0.8000                  | 0.8310                   | 0.8699                  | 0.7101                     |
+| **Safe Recall**      | 0.7896                      | 0.8825                     | 0.8634                      | 0.8595                  | 0.8342                   | 0.7677                  | 0.9428 ✓                   |
+| **Safe F1 Score**    | 0.8480                      | 0.8580 ✓                   | 0.8551                      | 0.8287                  | 0.8326                   | 0.8156                  | 0.8101                     |
+| **Unsafe Precision** | 0.7386                      | 0.8073                     | 0.7891                      | 0.7630                  | 0.7500                   | 0.7039                  | 0.8314 ✓                   |
+| **Unsafe Recall**    | 0.8909 ✓                    | 0.7383                     | 0.7660                      | 0.6779                  | 0.7456                   | 0.8279                  | 0.4230                     |
+| **Unsafe F1 Score**  | 0.8076 ✓                    | 0.7713                     | 0.7774                      | 0.7179                  | 0.7478                   | 0.7609                  | 0.5607                     |
+# Limitations
+- Mezzo Prompt Guard may flag safe messages as unsafe occasionally, I recommend increasing the threshold for unsafe messages to 0.7 - 0.8 for a lower FPR, or a threshold of 0.3-0.4 for best catching prompt injections
+- More sophisticated attacks outside of its training data may bypass the model, report examples of this in discussions to help me improve these models!