Instructions to use xTRam1/safe-guard-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use xTRam1/safe-guard-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="xTRam1/safe-guard-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("xTRam1/safe-guard-classifier") model = AutoModelForSequenceClassification.from_pretrained("xTRam1/safe-guard-classifier") - Notebooks
- Google Colab
- Kaggle
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("xTRam1/safe-guard-classifier")
model = AutoModelForSequenceClassification.from_pretrained("xTRam1/safe-guard-classifier")We formulated the prompt injection detector problem as a classification problem and trained our own language model to detect whether a given user prompt is an attack or safe. First, to train our own prompt injection detector, we required high-quality labelled data; however, existing prompt injection datasets were either too small (on the magnitude of O(100)) or didn’t cover a broad spectrum of prompt injection attacks. To this end, inspired by the GLAN paper, we created a custom synthetic prompt injection dataset using a categorical tree structure and generated 3000 distinct attacks. We started by curating our seed data using open-source datasets (vmware/open-instruct, huggingfaceh4/helpful-instructions, Fka-awesome-chatgpt-prompts, jackhhao/jailbreak-classification). Then we identified various prompt objection categories (context manipulation, social engineering, ignore prompt, fake completion…) and prompted GPT-3.5-turbo in a categorical tree structure to generate prompt injection attacks for every category. Our final custom dataset consisted of 7000 positive/safe prompts and 3000 injection prompts. We also curated a test set of size 600 prompts following the same approach. Using our custom dataset, we fine-tuned DeBERTa-v3-small from scratch. We compared our model’s performance to the best-performing prompt injection classifier from ProtectAI and observed a 4.9% accuracy increase on our held-out test data. Specifically, our custom model achieved an accuracy of 99.6%, compared to the 94.7% accuracy of ProtecAI’s model, all the while being 2X smaller (44M (ours) vs. 86M (theirs)).
Team:
Lutfi Eren Erdogan (lerdogan@berkeley.edu)
Chuyi Shang (chuyishang@berkeley.edu)
Aryan Goyal (aryangoyal@berkeley.edu)
Siddarth Ijju (sidijju@berkeley.edu)
Links
- Downloads last month
- 288
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="xTRam1/safe-guard-classifier")