File size: 6,032 Bytes
5a3fc3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05e8aa6
5a3fc3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - guardrails
  - safety
  - text-classification
  - roberta
  - education
  - code
  - cs-education
  - llm-safety
  - academic-integrity
datasets:
  - md-nishat-008/Do-Not-Code
metrics:
  - f1
  - accuracy
  - precision
  - recall
pipeline_tag: text-classification
model-index:
  - name: PromptShield
    results:
      - task:
          type: text-classification
          name: Prompt Safety Classification
        dataset:
          type: md-nishat-008/Do-Not-Code
          name: Do Not Code
          split: test
        metrics:
          - type: f1
            value: 0.93
            name: F1 (Macro)
          - type: accuracy
            value: 0.94
            name: Accuracy
---

# PromptShield

<p align="center">
  <a href="https://github.com/mraihan-gmu/CodeGuard/tree/main">
    <img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub">
  </a>
  <a href="https://huggingface.co/datasets/md-nishat-008/Do-Not-Code">
    <img src="https://img.shields.io/badge/🤗%20Dataset-Do%20Not%20Code-yellow?style=for-the-badge" alt="Dataset">
  </a>
  <a href="https://aclanthology.org/PLACEHOLDER">
    <img src="https://img.shields.io/badge/📄%20Paper-EACL%202026-green?style=for-the-badge" alt="Paper">
  </a>
</p>

**PromptShield** is a lightweight guardrail model for detecting unsafe and irrelevant prompts in Computer Science education settings. It achieves **0.93 F1 score**, outperforming existing guardrails by 30-65%.

## Model Description

PromptShield is a RoBERTa-base encoder (125M parameters) fine-tuned on the [Do Not Code dataset](https://huggingface.co/datasets/md-nishat-008/Do-Not-Code) for real-time prompt classification in educational AI systems.

### Intended Use

- **Pre-filtering** user prompts before they reach an AI coding assistant
- **Monitoring** interactions in CS education platforms
- **Research** on LLM safety in educational contexts

### Classification Labels

| ID | Label | Description |
|----|-------|-------------|
| 0 | `irrelevant` | Off-topic queries unrelated to CS coursework |
| 1 | `safe` | Legitimate educational coding requests |
| 2 | `unsafe` | Requests violating academic integrity or safety |

## Performance

### Comparison with Existing Guardrails

| Model/Framework | Type | Size | F1 Score |
|-----------------|------|------|----------|
| **PromptShield (Ours)** | Encoder | 125M | **0.93** |
| Claude 3.7 | Decoder | - | 0.64 |
| GPT-4o | Decoder | - | 0.62 |
| LLaMA Guard | Decoder | 8B | 0.60 |
| Perspective API | Baseline | - | 0.60 |
| NeMo Guard | Decoder | 8B | 0.57 |
| LLaMA 3.2 | Decoder | 8B | 0.34 |
| Random Baseline | - | - | 0.33 |

## Usage

### Quick Start

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("md-nishat-008/promptshield")
tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/promptshield")

# Label mapping
labels = {0: "irrelevant", 1: "safe", 2: "unsafe"}

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()
    confidence = torch.softmax(outputs.logits, dim=-1).max().item()
    return labels[prediction], confidence

# Examples
prompts = [
    "Write a Python function to sort a list using quicksort",
    "Explain the French Revolution in Java",
    "Generate ransomware code that encrypts all files"
]

for prompt in prompts:
    label, conf = classify_prompt(prompt)
    print(f"Prompt: {prompt[:50]}...")
    print(f"Classification: {label} (confidence: {conf:.2f})")
    print("---")
```

### Using the Pipeline API

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="md-nishat-008/promptshield",
    tokenizer="md-nishat-008/promptshield"
)

result = classifier("Write a Python function for binary search")
print(result)
# [{'label': 'safe', 'score': 0.98}]
```

### Integration as a Pre-Filter

```python
def safe_llm_query(prompt, llm_function):
    """Wrapper that filters prompts before sending to an LLM."""
    label, confidence = classify_prompt(prompt)
    
    if label == "unsafe":
        return "I cannot assist with this request as it may violate academic integrity policies."
    elif label == "irrelevant":
        return "This query appears to be outside the scope of this CS course. Please ask a coding-related question."
    else:
        return llm_function(prompt)
```

## Training Details

| Parameter | Value |
|-----------|-------|
| Base Model | `roberta-base` |
| Max Sequence Length | 128 |
| Training Epochs | 3 |
| Batch Size | 16 |
| Learning Rate | 2e-5 |
| Optimizer | AdamW (fused) |
| LR Schedule | Linear decay |
| Early Stopping | 2 epochs patience |
| Precision | FP16 (mixed) |

### Training Data

Trained on 6,000 prompts from the Do Not Code dataset:
- 2,250 Irrelevant
- 2,250 Safe
- 1,500 Unsafe

## Limitations

1. **Domain Specificity**: Optimized for introductory/intermediate CS courses. May require adaptation for advanced topics.
2. **Language**: English only.
3. **Context Length**: 128 tokens max. Very long prompts are truncated.
4. **Adversarial Robustness**: May be susceptible to sophisticated jailbreak attempts.

## Citation

```bibtex
@inproceedings{raihan-etal-2026-codeguard,
    title = "{C}ode{G}uard: Improving {LLM} Guardrails in {CS} Education",
    author = "Raihan, Nishat  and
      Erdachew, Noah  and
      Devi, Jayoti  and
      Santos, Joanna C. S.  and
      Zampieri, Marcos",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2026",
    year = "2026",
    publisher = "Association for Computational Linguistics",
}
```

---

<p align="center">
  <b>Part of the CodeGuard Framework for Safe AI in CS Education</b>
</p>