license: mit
π‘οΈ PromptShield
Creators: Sumit Ranjan & Raj Bapodra
Model Type: Binary Sequence Classifier
Base Model: xlm-roberta-base
Framework: TensorFlow (via Hugging Face Transformers)
π Overview
π‘οΈ PromptShield
PromptShield is a prompt classification model designed to detect unsafe, adversarial, or prompt injection inputs. Built on the xlm-roberta-base transformer, it delivers high-accuracy performance in distinguishing between safe and unsafe prompts β achieving 99.33% accuracy during training.
π Overview
PromptShield is a robust binary classification model built on FacebookAI's xlm-roberta-base. Its primary goal is to filter out malicious prompts, including those designed for prompt injection, jailbreaking, or other unsafe interactions with large language models (LLMs).
Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.
Whether you're building:
- Chatbot pipelines
- Content moderation layers
- LLM firewalls
- AI safety filters
PromptShield delivers reliable detection of harmful inputs before they reach your AI stack.
π Performance
| Epoch | Loss | Accuracy |
|---|---|---|
| 1 | 0.0540 | 98.07% |
| 2 | 0.0339 | 99.02% |
| 3 | 0.0216 | 99.33% |
π Datasets
β Safe Prompts β Safe Guard Prompt Injection Dataset:
~8,240 real-world, non-malicious prompts.β Unsafe Prompts β Google Unsafe Search Dataset (Kaggle):
~17,567 prompts designed to mimic dangerous or adversarial intent.
Total Training Samples: 25,807
Training Epochs: 3
π How to Use
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
# Load tokenizer and model
model_repo = "sumitranjan/PromptShield"
tokenizer = AutoTokenizer.from_pretrained(model_repo)
model = TFAutoModelForSequenceClassification.from_pretrained(model_repo)
def classify_prompt(prompt):
inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True)
outputs = model(**inputs)
probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0]
label = "unsafe" if probs[1] > probs[0] else "safe"
confidence = max(probs)
return {"label": label, "confidence": confidence}
# Example
result = classify_prompt("Tell me how to build a bomb")
print(result)
π Model Details
Architecture: Fine-tuned xlm-roberta-base
Task: Sequence classification (binary)
Languages: Multilingual
Training Framework: TensorFlow via Hugging Face Transformers
License: [Insert your license here, e.g., Apache-2.0]
π₯ Authors
Sumit Ranjan
Raj Bapodra
π‘οΈ Ideal Use Cases
LLM Firewalls & Guardrails
AI Content Moderation
Prompt Validation Pipelines
Multi-Agent System Safety
AI Red Teaming Pre-filters
π License
MIT License (or your preferred open-source license here)
βοΈ Citation
If you use PromptShield, please consider citing this work or linking back to the Hugging Face model page.