GenTelLab
/

gentelshield-v1

@@ -1,17 +1,107 @@
----
-license: apache-2.0
-language:
-- en
-metrics:
-- accuracy
-- f1
-- precision
-- recall
-base_model: intfloat/multilingual-e5-base
-pipeline_tag: text-classification
-tags:
-- LLM Prompt Injection Attack
-- Jailbreaking Attack
-- Prompt Leaking
-- Goal Hijacking
----

+# Model Details
+The development of the GenTel-Shield detection model follows a five-step process. First, a training dataset is constructed by gathering data from online sources and expert contributions. This data then undergoes binary labeling and cleaning to ensure quality. Next, data augmentation techniques are applied to expand the dataset. Following this, a pre-trained model is employed for the training phase. Finally, the trained model can distinguish between malicious and benign samples.
+Below is a  workflow of GenTel-Shield.
+![gentel-shield](C:\Users\17391\Desktop\gitbox\jailbreaking-llms.github.io-main\jailbreaking-llms.github.io-main\static\images\gentel-shield.png)
+# Training Data Preparation
+### Data Collection
+Our training data is drawn from two primary sources. The first source encompasses risk data from public platforms, including websites such as jailbreakchat.com and reddit.com, in addition to established datasets from LLM applications, such as the VMware Open-Instruct dataset and the Chatbot Instruction Prompts dataset. And domain experts have annotated these examples, categorizing the prompts into two distinct groups: harmful injection attack samples and benign samples.
+### Data Augmentation
+In real-world scenarios, we have encountered adversarial samples, such as those with added meaningless characters or deleted words, that can bypass detection by defense models, potentially leading to dangerous behaviors. To enhance the robustness of our detection model, we implemented data augmentation focusing on both semantic alterations and character-level perturbations of the samples. We employed four simple yet effective operations for character perturbation: synonym replacement, random insertion, random swap, and random deletion. We used LLMs to rewrite our data for semantic augmentation, thereby generating a more diverse set of training samples.
+### Model Training Details
+We finetune the GenTel-Shield model on our proposed training text-pair dataset, initializing it from the multilingual E5 text embedding model. Training is conducted on a single machine equipped with one NVIDIA GeForce RTX 4090D (24GB) GPU, using a batch size of 32. The model is trained with a learning rate 2e-5, employing a cosine learning rate scheduler and a weight decay of 0.01 to mitigate overfitting. To optimize memory usage, we utilize mixed precision (fp16) training. Additionally, the training process includes a 500-step warmup phase, and we apply gradient clipping with a maximum norm of 1.0.
+# Evaluation
+### Dataset
+Gentel-Bench provides a comprehensive framework for evaluating the robustness of models against a wide range of injection attacks. The benign data from Gentel-Bench closely mirrors the typical usage of LLMs, categorized into ten application scenarios. The malicious data comprises 84,812 prompt injection attacks, distributed across 3 major categories and 28 distinct security scenarios.
+### Gentel-Bench
+We evaluate the model’s effectiveness in detecting Jailbreak, Goal Hijacking, and Prompt Leaking attacks on Gentel-Bench. The results demonstrate that our approach outperforms existing methods in most scenarios, particularly in terms of accuracy and F1 score.
+#### Classification performance on Jailbreak Attack Scenarios
+| Method | Accuracy ↑ | Precision ↑ | F1 ↑ | Recall ↑ |
+|--------------------------|:---------|:------------|:-----------------------------|:-----------------------------|
+| ProtectAI   | 89.46 | 99.59 |   88.62   |          79.83          |
+| Hyperion    | 94.70 | 94.21 |   94.88   |          95.57          |
+| Prompt Guard         | 50.58    | 51.03 |    66.85    |          96.88          |
+|Lakera AI|87.20|92.12|86.84|82.14|
+|Deepset|65.69|60.63|75.49|100|
+|Fmops|63.35|59.04|74.25|100|
+|WhyLabs LangKit|78.86|98.48|75.28|60.92|
+|GenTel-Shield(Ours)|97.63|98.04|97.69|97.34|
+#### Classification performance on Goal Hijacking Attack Scenarios.
+| Method              | Accuracy ↑ | Precision ↑ | F1 ↑  | Recall ↑ |
+| ------------------- | :--------- | :---------- | :---- | :------- |
+| ProtectAI           | 94.25      | 99.79       | 93.95 | 88.76    |
+| Hyperion            | 90.68      | 94.53       | 90.33 | 86.48    |
+| Prompt Guard        | 50.90      | 50.61       | 67.21 | 100      |
+| Lakera AI           | 74.63      | 88.59       | 69.33 | 56.95    |
+| Deepset             | 63.40      | 57.90       | 73.34 | 100      |
+| Fmops               | 61.03      | 56.36       | 72.09 | 100      |
+| WhyLabs LangKit     | 68.14      | 97.53       | 54.35 | 37.67    |
+| GenTel-Shield(Ours) | 96.81      | 99.44       | 96.74 | 94.19    |
+#### Classification Performance on Prompt Leaking Attack Scenarios.
+| Method              | Accuracy ↑ | Precision ↑ | F1 ↑  | Recall ↑ |
+| ------------------- | :--------- | :---------- | :---- | :------- |
+| ProtectAI           | 90.94      | 99.77       | 90.06 | 82.08    |
+| Hyperion            | 90.85      | 95.01       | 90.41 | 86.23    |
+| Prompt Guard        | 50.28      | 50.14       | 66.79 | 100      |
+| Lakera AI           | 96.04      | 93.11       | 96.17 | 99.43    |
+| Deepset             | 61.79      | 57.08       | 71.34 | 95.09    |
+| Fmops               | 58.77      | 55.07       | 69.80 | 95.28    |
+| WhyLabs LangKit     | 99.34      | 99.62       | 99.34 | 99.06    |
+| GenTel-Shield(Ours) | 97.92      | 99.42       | 97.89 | 96.42    |
+#### Subdivision Scenarios
+![fig_3](C:\Users\17391\Desktop\gitbox\jailbreaking-llms.github.io-main\jailbreaking-llms.github.io-main\static\images\fig_3.png)
+# Citation
+```
+<!-- @misc{dubey2024llama3herdmodels,
+  title =         {The Llama 3 Herd of Models},
+  author =        {Llama Team, AI @ Meta},
+  year =          {2024}
+  eprint =        {2407.21783},
+  archivePrefix = {arXiv},
+  primaryClass =  {cs.AI},
+  url =           {https://arxiv.org/abs/2407.21783}
+} -->
+```