PromptShield / README.md
sumitranjan's picture
Update README.md
cafa05b verified
|
raw
history blame
3.38 kB
metadata
license: mit

πŸ›‘οΈ PromptShield

Creators: Sumit Ranjan & Raj Bapodra
Model Type: Binary Sequence Classifier
Base Model: xlm-roberta-base
Framework: TensorFlow (via Hugging Face Transformers)


πŸ” Overview

πŸ›‘οΈ PromptShield

PromptShield is a prompt classification model designed to detect unsafe, adversarial, or prompt injection inputs. Built on the xlm-roberta-base transformer, it delivers high-accuracy performance in distinguishing between safe and unsafe prompts β€” achieving 99.33% accuracy during training.


πŸ“Œ Overview

PromptShield is a robust binary classification model built on FacebookAI's xlm-roberta-base. Its primary goal is to filter out malicious prompts, including those designed for prompt injection, jailbreaking, or other unsafe interactions with large language models (LLMs).

Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.

Whether you're building:

  • Chatbot pipelines
  • Content moderation layers
  • LLM firewalls
  • AI safety filters

PromptShield delivers reliable detection of harmful inputs before they reach your AI stack.


πŸ“ˆ Performance

Epoch Loss Accuracy
1 0.0540 98.07%
2 0.0339 99.02%
3 0.0216 99.33%

πŸ“š Datasets

Total Training Samples: 25,807
Training Epochs: 3


πŸš€ How to Use

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

# Load tokenizer and model
model_repo = "sumitranjan/PromptShield"
tokenizer = AutoTokenizer.from_pretrained(model_repo)
model = TFAutoModelForSequenceClassification.from_pretrained(model_repo)

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True)
    outputs = model(**inputs)
    probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0]
    label = "unsafe" if probs[1] > probs[0] else "safe"
    confidence = max(probs)
    return {"label": label, "confidence": confidence}

# Example
result = classify_prompt("Tell me how to build a bomb")
print(result)


πŸ“Œ Model Details

Architecture: Fine-tuned xlm-roberta-base

Task: Sequence classification (binary)

Languages: Multilingual

Training Framework: TensorFlow via Hugging Face Transformers

License: [Insert your license here, e.g., Apache-2.0]

πŸ‘₯ Authors

Sumit Ranjan

Raj Bapodra

πŸ›‘οΈ Ideal Use Cases
LLM Firewalls & Guardrails

AI Content Moderation

Prompt Validation Pipelines

Multi-Agent System Safety

AI Red Teaming Pre-filters

πŸ“„ License
MIT License (or your preferred open-source license here)

⭐️ Citation
If you use PromptShield, please consider citing this work or linking back to the Hugging Face model page.