GenTelLab commited on
Commit
e455370
·
verified ·
1 Parent(s): 9713894

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -17
README.md CHANGED
@@ -1,17 +1,107 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- metrics:
6
- - accuracy
7
- - f1
8
- - precision
9
- - recall
10
- base_model: intfloat/multilingual-e5-base
11
- pipeline_tag: text-classification
12
- tags:
13
- - LLM Prompt Injection Attack
14
- - Jailbreaking Attack
15
- - Prompt Leaking
16
- - Goal Hijacking
17
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Model Details
3
+
4
+ The development of the GenTel-Shield detection model follows a five-step process. First, a training dataset is constructed by gathering data from online sources and expert contributions. This data then undergoes binary labeling and cleaning to ensure quality. Next, data augmentation techniques are applied to expand the dataset. Following this, a pre-trained model is employed for the training phase. Finally, the trained model can distinguish between malicious and benign samples.
5
+
6
+ Below is a workflow of GenTel-Shield.
7
+
8
+ ![gentel-shield](C:\Users\17391\Desktop\gitbox\jailbreaking-llms.github.io-main\jailbreaking-llms.github.io-main\static\images\gentel-shield.png)
9
+
10
+ # Training Data Preparation
11
+
12
+
13
+
14
+ ### Data Collection
15
+
16
+ Our training data is drawn from two primary sources. The first source encompasses risk data from public platforms, including websites such as jailbreakchat.com and reddit.com, in addition to established datasets from LLM applications, such as the VMware Open-Instruct dataset and the Chatbot Instruction Prompts dataset. And domain experts have annotated these examples, categorizing the prompts into two distinct groups: harmful injection attack samples and benign samples.
17
+
18
+
19
+
20
+ ### Data Augmentation
21
+
22
+ In real-world scenarios, we have encountered adversarial samples, such as those with added meaningless characters or deleted words, that can bypass detection by defense models, potentially leading to dangerous behaviors. To enhance the robustness of our detection model, we implemented data augmentation focusing on both semantic alterations and character-level perturbations of the samples. We employed four simple yet effective operations for character perturbation: synonym replacement, random insertion, random swap, and random deletion. We used LLMs to rewrite our data for semantic augmentation, thereby generating a more diverse set of training samples.
23
+
24
+
25
+
26
+ ### Model Training Details
27
+
28
+ We finetune the GenTel-Shield model on our proposed training text-pair dataset, initializing it from the multilingual E5 text embedding model. Training is conducted on a single machine equipped with one NVIDIA GeForce RTX 4090D (24GB) GPU, using a batch size of 32. The model is trained with a learning rate 2e-5, employing a cosine learning rate scheduler and a weight decay of 0.01 to mitigate overfitting. To optimize memory usage, we utilize mixed precision (fp16) training. Additionally, the training process includes a 500-step warmup phase, and we apply gradient clipping with a maximum norm of 1.0.
29
+
30
+
31
+
32
+ # Evaluation
33
+
34
+ ### Dataset
35
+
36
+ Gentel-Bench provides a comprehensive framework for evaluating the robustness of models against a wide range of injection attacks. The benign data from Gentel-Bench closely mirrors the typical usage of LLMs, categorized into ten application scenarios. The malicious data comprises 84,812 prompt injection attacks, distributed across 3 major categories and 28 distinct security scenarios.
37
+
38
+ ### Gentel-Bench
39
+
40
+ We evaluate the model’s effectiveness in detecting Jailbreak, Goal Hijacking, and Prompt Leaking attacks on Gentel-Bench. The results demonstrate that our approach outperforms existing methods in most scenarios, particularly in terms of accuracy and F1 score.
41
+
42
+
43
+
44
+ #### Classification performance on Jailbreak Attack Scenarios
45
+
46
+ | Method | Accuracy ↑ | Precision ↑ | F1 ↑ | Recall ↑ |
47
+ |--------------------------|:---------|:------------|:-----------------------------|:-----------------------------|
48
+ | ProtectAI | 89.46 | 99.59 | 88.62 | 79.83 |
49
+ | Hyperion | 94.70 | 94.21 | 94.88 | 95.57 |
50
+ | Prompt Guard | 50.58 | 51.03 | 66.85 | 96.88 |
51
+ |Lakera AI|87.20|92.12|86.84|82.14|
52
+ |Deepset|65.69|60.63|75.49|100|
53
+ |Fmops|63.35|59.04|74.25|100|
54
+ |WhyLabs LangKit|78.86|98.48|75.28|60.92|
55
+ |GenTel-Shield(Ours)|97.63|98.04|97.69|97.34|
56
+
57
+
58
+
59
+ #### Classification performance on Goal Hijacking Attack Scenarios.
60
+
61
+
62
+ | Method | Accuracy ↑ | Precision ↑ | F1 ↑ | Recall ↑ |
63
+ | ------------------- | :--------- | :---------- | :---- | :------- |
64
+ | ProtectAI | 94.25 | 99.79 | 93.95 | 88.76 |
65
+ | Hyperion | 90.68 | 94.53 | 90.33 | 86.48 |
66
+ | Prompt Guard | 50.90 | 50.61 | 67.21 | 100 |
67
+ | Lakera AI | 74.63 | 88.59 | 69.33 | 56.95 |
68
+ | Deepset | 63.40 | 57.90 | 73.34 | 100 |
69
+ | Fmops | 61.03 | 56.36 | 72.09 | 100 |
70
+ | WhyLabs LangKit | 68.14 | 97.53 | 54.35 | 37.67 |
71
+ | GenTel-Shield(Ours) | 96.81 | 99.44 | 96.74 | 94.19 |
72
+
73
+
74
+
75
+ #### Classification Performance on Prompt Leaking Attack Scenarios.
76
+
77
+
78
+ | Method | Accuracy ↑ | Precision ↑ | F1 ↑ | Recall ↑ |
79
+ | ------------------- | :--------- | :---------- | :---- | :------- |
80
+ | ProtectAI | 90.94 | 99.77 | 90.06 | 82.08 |
81
+ | Hyperion | 90.85 | 95.01 | 90.41 | 86.23 |
82
+ | Prompt Guard | 50.28 | 50.14 | 66.79 | 100 |
83
+ | Lakera AI | 96.04 | 93.11 | 96.17 | 99.43 |
84
+ | Deepset | 61.79 | 57.08 | 71.34 | 95.09 |
85
+ | Fmops | 58.77 | 55.07 | 69.80 | 95.28 |
86
+ | WhyLabs LangKit | 99.34 | 99.62 | 99.34 | 99.06 |
87
+ | GenTel-Shield(Ours) | 97.92 | 99.42 | 97.89 | 96.42 |
88
+
89
+
90
+
91
+ #### Subdivision Scenarios
92
+
93
+ ![fig_3](C:\Users\17391\Desktop\gitbox\jailbreaking-llms.github.io-main\jailbreaking-llms.github.io-main\static\images\fig_3.png)
94
+
95
+
96
+ # Citation
97
+ ```
98
+ <!-- @misc{dubey2024llama3herdmodels,
99
+ title = {The Llama 3 Herd of Models},
100
+ author = {Llama Team, AI @ Meta},
101
+ year = {2024}
102
+ eprint = {2407.21783},
103
+ archivePrefix = {arXiv},
104
+ primaryClass = {cs.AI},
105
+ url = {https://arxiv.org/abs/2407.21783}
106
+ } -->
107
+ ```