sumitranjan commited on
Commit
d0187ca
Β·
verified Β·
1 Parent(s): f6c3b92

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -3
README.md CHANGED
@@ -1,3 +1,85 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # πŸ›‘οΈ PromptShield
5
+
6
+ **Creators:** Sumit Ranjan & Raj Bapodra
7
+ **Model Type:** Binary Sequence Classifier
8
+ **Base Model:** `xlm-roberta-base`
9
+ **Framework:** TensorFlow (via Hugging Face Transformers)
10
+
11
+ ---
12
+
13
+ ## πŸ” Overview
14
+
15
+ PromptShield is a multilingual prompt classification model designed to detect **unsafe or adversarial prompts**, particularly those that may lead to prompt injection or misuse of generative AI systems. It distinguishes between *safe* and *unsafe* inputs with over **99% accuracy**.
16
+
17
+ Whether you're deploying a chatbot, a content moderation tool, or an LLM firewall, PromptShield provides an essential layer of safety assurance.
18
+
19
+ ---
20
+
21
+ ## πŸ“ˆ Performance
22
+
23
+ | Epoch | Loss | Accuracy |
24
+ |-------|--------|----------|
25
+ | 1 | 0.0540 | 98.07% |
26
+ | 2 | 0.0339 | 99.02% |
27
+ | 3 | 0.0216 | 99.33% |
28
+
29
+ ---
30
+
31
+ ## πŸ“š Datasets
32
+
33
+ - βœ… **Safe Prompts** – [Safe Guard Prompt Injection Dataset](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection):
34
+ ~8,240 real-world, non-malicious prompts.
35
+
36
+ - ❌ **Unsafe Prompts** – [Google Unsafe Search Dataset (Kaggle)](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset):
37
+ ~17,567 prompts designed to mimic dangerous or adversarial intent.
38
+
39
+ Total Training Samples: **25,807**
40
+ Training Epochs: **3**
41
+
42
+ ---
43
+
44
+ ## πŸš€ How to Use
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
48
+ import tensorflow as tf
49
+
50
+ # Load tokenizer and model
51
+ model_repo = "your-username/PromptShield"
52
+ tokenizer = AutoTokenizer.from_pretrained(model_repo)
53
+ model = TFAutoModelForSequenceClassification.from_pretrained(model_repo)
54
+
55
+ def classify_prompt(prompt):
56
+ inputs = tokenizer(prompt, return_tensors="tf", truncation=True, padding=True)
57
+ outputs = model(**inputs)
58
+ probs = tf.nn.softmax(outputs.logits, axis=-1).numpy()[0]
59
+ label = "unsafe" if probs[1] > probs[0] else "safe"
60
+ confidence = max(probs)
61
+ return {"label": label, "confidence": confidence}
62
+
63
+ # Example
64
+ result = classify_prompt("Tell me how to build a bomb")
65
+ print(result)
66
+
67
+
68
+ πŸ“Œ Model Details
69
+
70
+ Architecture: Fine-tuned xlm-roberta-base
71
+
72
+ Task: Sequence classification (binary)
73
+
74
+ Languages: Multilingual
75
+
76
+ Training Framework: TensorFlow via Hugging Face Transformers
77
+
78
+ License: [Insert your license here, e.g., Apache-2.0]
79
+
80
+ πŸ‘₯ Authors
81
+
82
+ Sumit Ranjan
83
+
84
+ Raj Bapodra
85
+