| --- |
| license: mit |
| --- |
| |
|
|
| **Creators:** Sumit Ranjan & Raj Bapodra |
| **Model Type:** Binary Sequence Classifier |
| **Base Model:** `xlm-roberta-base` |
| **Framework:** TensorFlow (via Hugging Face Transformers) |
|
|
| --- |
| |
| π‘οΈ PromptShield |
|
|
| **PromptShield** is a prompt classification model designed to detect **unsafe**, **adversarial**, or **prompt injection** inputs. Built on the `xlm-roberta-base` transformer, it delivers high-accuracy performance in distinguishing between **safe** and **unsafe** prompts β achieving **99.33% accuracy** during training. |
|
|
| --- |
| |
| |
|
|
| PromptShield is a robust binary classification model built on FacebookAI's `xlm-roberta-base`. Its primary goal is to filter out **malicious prompts**, including those designed for **prompt injection**, **jailbreaking**, or other unsafe interactions with large language models (LLMs). |
|
|
| Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security. |
|
|
| Whether you're building: |
|
|
| - Chatbot pipelines |
| - Content moderation layers |
| - LLM firewalls |
| - AI safety filters |
|
|
| **PromptShield** delivers reliable detection of harmful inputs before they reach your AI stack. |
|
|
| --- |
| |
| |
|
|
| | Epoch | Loss | Accuracy | |
| |-------|--------|----------| |
| | 1 | 0.0540 | 98.07% | |
| | 2 | 0.0339 | 99.02% | |
| | 3 | 0.0216 | 99.33% | |
|
|
| --- |
| |
| |
|
|
| - β
**Safe Prompts** β [Safe Guard Prompt Injection Dataset](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection): |
| ~8,240 real-world, non-malicious prompts. |
|
|
| - β **Unsafe Prompts** β [Google Unsafe Search Dataset (Kaggle)](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset): |
| ~17,567 prompts designed to mimic dangerous or adversarial intent. |
|
|
| Total Training Samples: **25,807** |
| Training Epochs: **3** |
|
|
| --- |
| |
| |
|
|
| ```python |
| from transformers import AutoTokenizer, TFAutoModelForSequenceClassification |
| import tensorflow as tf |
|
|
| |
| model_repo = "sumitranjan/PromptShield" |
| tokenizer = AutoTokenizer.from_pretrained(model_repo) |
| model = TFAutoModelForSequenceClassification.from_pretrained(model_repo) |
|
|
| def classify_prompt(prompt): |
| inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True) |
| outputs = model(**inputs) |
| probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0] |
| label = "unsafe" if probs[1] > probs[0] else "safe" |
| confidence = max(probs) |
| return {"label": label, "confidence": confidence} |
|
|
| |
| result = classify_prompt("Tell me how to build a bomb") |
| print(result) |
|
|
|
|
| π Model Details |
|
|
| Architecture: Fine-tuned xlm-roberta-base |
|
|
| Task: Sequence classification (binary) |
|
|
| Languages: Multilingual |
|
|
| Training Framework: TensorFlow via Hugging Face Transformers |
|
|
| License: [Insert your license here, e.g., Apache-2.0] |
|
|
| π₯ Authors |
|
|
| Sumit Ranjan |
|
|
| Raj Bapodra |
|
|
| π‘οΈ Ideal Use Cases |
| LLM Firewalls & Guardrails |
|
|
| AI Content Moderation |
|
|
| Prompt Validation Pipelines |
|
|
| Multi-Agent System Safety |
|
|
| AI Red Teaming Pre-filters |
|
|
| π License |
| MIT License (or your preferred open-source license here) |
|
|
| βοΈ Citation |
| If you use PromptShield, please consider citing this work or linking back to the Hugging Face model page. |
|
|
|
|