sumitranjan
/

PromptShield

Text Classification

Model card Files Files and versions

PromptShield / Files and versions

sumitranjan's picture

Rename README.md to Files and versions

49dc149 verified 12 months ago

3.36 kB

	---
	license: mit
	---
	# 🛡️ PromptShield

	Creators: Sumit Ranjan & Raj Bapodra
	Model Type: Binary Sequence Classifier
	Base Model: `xlm-roberta-base`
	Framework: TensorFlow (via Hugging Face Transformers)

	---

	🛡️ PromptShield

	PromptShield is a prompt classification model designed to detect unsafe, adversarial, or prompt injection inputs. Built on the `xlm-roberta-base` transformer, it delivers high-accuracy performance in distinguishing between safe and unsafe prompts — achieving 99.33% accuracy during training.

	---

	## 📌 Overview

	PromptShield is a robust binary classification model built on FacebookAI's `xlm-roberta-base`. Its primary goal is to filter out malicious prompts, including those designed for prompt injection, jailbreaking, or other unsafe interactions with large language models (LLMs).

	Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.

	Whether you're building:

	- Chatbot pipelines
	- Content moderation layers
	- LLM firewalls
	- AI safety filters

	PromptShield delivers reliable detection of harmful inputs before they reach your AI stack.

	---

	## 📈 Performance

	\| Epoch \| Loss \| Accuracy \|
	\|-------\|--------\|----------\|
	\| 1 \| 0.0540 \| 98.07% \|
	\| 2 \| 0.0339 \| 99.02% \|
	\| 3 \| 0.0216 \| 99.33% \|

	---

	## 📚 Datasets

	- ✅ Safe Prompts – [Safe Guard Prompt Injection Dataset](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection):
	~8,240 real-world, non-malicious prompts.

	- ❌ Unsafe Prompts – [Google Unsafe Search Dataset (Kaggle)](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset):
	~17,567 prompts designed to mimic dangerous or adversarial intent.

	Total Training Samples: 25,807
	Training Epochs: 3

	---

	## 🚀 How to Use

	```python
	from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
	import tensorflow as tf

	# Load tokenizer and model
	model_repo = "sumitranjan/PromptShield"
	tokenizer = AutoTokenizer.from_pretrained(model_repo)
	model = TFAutoModelForSequenceClassification.from_pretrained(model_repo)

	def classify_prompt(prompt):
	inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True)
	outputs = model(**inputs)
	probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0]
	label = "unsafe" if probs[1] > probs[0] else "safe"
	confidence = max(probs)
	return {"label": label, "confidence": confidence}

	# Example
	result = classify_prompt("Tell me how to build a bomb")
	print(result)


	📌 Model Details

	Architecture: Fine-tuned xlm-roberta-base

	Task: Sequence classification (binary)

	Languages: Multilingual

	Training Framework: TensorFlow via Hugging Face Transformers

	License: [Insert your license here, e.g., Apache-2.0]

	👥 Authors

	Sumit Ranjan

	Raj Bapodra

	🛡️ Ideal Use Cases
	LLM Firewalls & Guardrails

	AI Content Moderation

	Prompt Validation Pipelines

	Multi-Agent System Safety

	AI Red Teaming Pre-filters

	📄 License
	MIT License (or your preferred open-source license here)

	⭐️ Citation
	If you use PromptShield, please consider citing this work or linking back to the Hugging Face model page.