Update README.md

cafa05b verified 11 months ago

3.38 kB

license: mit

🛡️ PromptShield

Creators: Sumit Ranjan & Raj Bapodra
Model Type: Binary Sequence Classifier
Base Model: xlm-roberta-base
Framework: TensorFlow (via Hugging Face Transformers)

🔍 Overview

🛡️ PromptShield

PromptShield is a prompt classification model designed to detect unsafe, adversarial, or prompt injection inputs. Built on the xlm-roberta-base transformer, it delivers high-accuracy performance in distinguishing between safe and unsafe prompts — achieving 99.33% accuracy during training.

📌 Overview

PromptShield is a robust binary classification model built on FacebookAI's xlm-roberta-base. Its primary goal is to filter out malicious prompts, including those designed for prompt injection, jailbreaking, or other unsafe interactions with large language models (LLMs).

Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.

Whether you're building:

Chatbot pipelines
Content moderation layers
LLM firewalls
AI safety filters

PromptShield delivers reliable detection of harmful inputs before they reach your AI stack.

📈 Performance

Epoch	Loss	Accuracy
1	0.0540	98.07%
2	0.0339	99.02%
3	0.0216	99.33%

📚 Datasets

✅ Safe Prompts – Safe Guard Prompt Injection Dataset:
~8,240 real-world, non-malicious prompts.
❌ Unsafe Prompts – Google Unsafe Search Dataset (Kaggle):
~17,567 prompts designed to mimic dangerous or adversarial intent.

Total Training Samples: 25,807
Training Epochs: 3

🚀 How to Use

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

# Load tokenizer and model
model_repo = "sumitranjan/PromptShield"
tokenizer = AutoTokenizer.from_pretrained(model_repo)
model = TFAutoModelForSequenceClassification.from_pretrained(model_repo)

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True)
    outputs = model(**inputs)
    probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0]
    label = "unsafe" if probs[1] > probs[0] else "safe"
    confidence = max(probs)
    return {"label": label, "confidence": confidence}

# Example
result = classify_prompt("Tell me how to build a bomb")
print(result)


📌 Model Details

Architecture: Fine-tuned xlm-roberta-base

Task: Sequence classification (binary)

Languages: Multilingual

Training Framework: TensorFlow via Hugging Face Transformers

License: [Insert your license here, e.g., Apache-2.0]

👥 Authors

Sumit Ranjan

Raj Bapodra

🛡️ Ideal Use Cases
LLM Firewalls & Guardrails

AI Content Moderation

Prompt Validation Pipelines

Multi-Agent System Safety

AI Red Teaming Pre-filters

📄 License
MIT License (or your preferred open-source license here)

⭐️ Citation
If you use PromptShield, please consider citing this work or linking back to the Hugging Face model page.