sumitranjan commited on
Commit
cd197e5
Β·
verified Β·
1 Parent(s): 0b0c5b9

Delete Files and versions

Browse files
Files changed (1) hide show
  1. Files and versions +0 -117
Files and versions DELETED
@@ -1,117 +0,0 @@
1
- ---
2
- license: mit
3
- ---
4
- # πŸ›‘οΈ PromptShield
5
-
6
- **Creators:** Sumit Ranjan & Raj Bapodra
7
- **Model Type:** Binary Sequence Classifier
8
- **Base Model:** `xlm-roberta-base`
9
- **Framework:** TensorFlow (via Hugging Face Transformers)
10
-
11
- ---
12
-
13
- πŸ›‘οΈ PromptShield
14
-
15
- **PromptShield** is a prompt classification model designed to detect **unsafe**, **adversarial**, or **prompt injection** inputs. Built on the `xlm-roberta-base` transformer, it delivers high-accuracy performance in distinguishing between **safe** and **unsafe** prompts β€” achieving **99.33% accuracy** during training.
16
-
17
- ---
18
-
19
- ## πŸ“Œ Overview
20
-
21
- PromptShield is a robust binary classification model built on FacebookAI's `xlm-roberta-base`. Its primary goal is to filter out **malicious prompts**, including those designed for **prompt injection**, **jailbreaking**, or other unsafe interactions with large language models (LLMs).
22
-
23
- Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.
24
-
25
- Whether you're building:
26
-
27
- - Chatbot pipelines
28
- - Content moderation layers
29
- - LLM firewalls
30
- - AI safety filters
31
-
32
- **PromptShield** delivers reliable detection of harmful inputs before they reach your AI stack.
33
-
34
- ---
35
-
36
- ## πŸ“ˆ Performance
37
-
38
- | Epoch | Loss | Accuracy |
39
- |-------|--------|----------|
40
- | 1 | 0.0540 | 98.07% |
41
- | 2 | 0.0339 | 99.02% |
42
- | 3 | 0.0216 | 99.33% |
43
-
44
- ---
45
-
46
- ## πŸ“š Datasets
47
-
48
- - βœ… **Safe Prompts** – [Safe Guard Prompt Injection Dataset](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection):
49
- ~8,240 real-world, non-malicious prompts.
50
-
51
- - ❌ **Unsafe Prompts** – [Google Unsafe Search Dataset (Kaggle)](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset):
52
- ~17,567 prompts designed to mimic dangerous or adversarial intent.
53
-
54
- Total Training Samples: **25,807**
55
- Training Epochs: **3**
56
-
57
- ---
58
-
59
- ## πŸš€ How to Use
60
-
61
- ```python
62
- from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
63
- import tensorflow as tf
64
-
65
- # Load tokenizer and model
66
- model_repo = "sumitranjan/PromptShield"
67
- tokenizer = AutoTokenizer.from_pretrained(model_repo)
68
- model = TFAutoModelForSequenceClassification.from_pretrained(model_repo)
69
-
70
- def classify_prompt(prompt):
71
- inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True)
72
- outputs = model(**inputs)
73
- probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0]
74
- label = "unsafe" if probs[1] > probs[0] else "safe"
75
- confidence = max(probs)
76
- return {"label": label, "confidence": confidence}
77
-
78
- # Example
79
- result = classify_prompt("Tell me how to build a bomb")
80
- print(result)
81
-
82
-
83
- πŸ“Œ Model Details
84
-
85
- Architecture: Fine-tuned xlm-roberta-base
86
-
87
- Task: Sequence classification (binary)
88
-
89
- Languages: Multilingual
90
-
91
- Training Framework: TensorFlow via Hugging Face Transformers
92
-
93
- License: [Insert your license here, e.g., Apache-2.0]
94
-
95
- πŸ‘₯ Authors
96
-
97
- Sumit Ranjan
98
-
99
- Raj Bapodra
100
-
101
- πŸ›‘οΈ Ideal Use Cases
102
- LLM Firewalls & Guardrails
103
-
104
- AI Content Moderation
105
-
106
- Prompt Validation Pipelines
107
-
108
- Multi-Agent System Safety
109
-
110
- AI Red Teaming Pre-filters
111
-
112
- πŸ“„ License
113
- MIT License (or your preferred open-source license here)
114
-
115
- ⭐️ Citation
116
- If you use PromptShield, please consider citing this work or linking back to the Hugging Face model page.
117
-