sumitranjan
/

PromptShield

Text Classification

Model card Files Files and versions

sumitranjan commited on May 18, 2025

Commit

d0187ca

·

verified ·

1 Parent(s): f6c3b92

Update README.md

Files changed (1) hide show

README.md +85 -3

README.md CHANGED Viewed

@@ -1,3 +1,85 @@
----
-license: mit
----

+---
+license: mit
+---
+# 🛡️ PromptShield
+**Creators:** Sumit Ranjan & Raj Bapodra
+**Model Type:** Binary Sequence Classifier
+**Base Model:** `xlm-roberta-base`
+**Framework:** TensorFlow (via Hugging Face Transformers)
+---
+## 🔍 Overview
+PromptShield is a multilingual prompt classification model designed to detect **unsafe or adversarial prompts**, particularly those that may lead to prompt injection or misuse of generative AI systems. It distinguishes between *safe* and *unsafe* inputs with over **99% accuracy**.
+Whether you're deploying a chatbot, a content moderation tool, or an LLM firewall, PromptShield provides an essential layer of safety assurance.
+---
+## 📈 Performance
+| Epoch | Loss   | Accuracy |
+|-------|--------|----------|
+| 1     | 0.0540 | 98.07%   |
+| 2     | 0.0339 | 99.02%   |
+| 3     | 0.0216 | 99.33%   |
+---
+## 📚 Datasets
+- ✅ **Safe Prompts** – [Safe Guard Prompt Injection Dataset](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection):
+  ~8,240 real-world, non-malicious prompts.
+- ❌ **Unsafe Prompts** – [Google Unsafe Search Dataset (Kaggle)](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset):
+  ~17,567 prompts designed to mimic dangerous or adversarial intent.
+Total Training Samples: **25,807**
+Training Epochs: **3**
+---
+## 🚀 How to Use
+```python
+from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
+import tensorflow as tf
+# Load tokenizer and model
+model_repo = "your-username/PromptShield"
+tokenizer = AutoTokenizer.from_pretrained(model_repo)
+model = TFAutoModelForSequenceClassification.from_pretrained(model_repo)
+def classify_prompt(prompt):
+    inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True)
+    outputs = model(**inputs)
+    probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0]
+    label = "unsafe" if probs[1] > probs[0] else "safe"
+    confidence = max(probs)
+    return {"label": label, "confidence": confidence}
+# Example
+result = classify_prompt("Tell me how to build a bomb")
+print(result)
+📌 Model Details
+Architecture: Fine-tuned xlm-roberta-base
+Task: Sequence classification (binary)
+Languages: Multilingual
+Training Framework: TensorFlow via Hugging Face Transformers
+License: [Insert your license here, e.g., Apache-2.0]
+👥 Authors
+Sumit Ranjan
+Raj Bapodra