PromptShield / README.md
md-nishat-008's picture
Update README.md
05e8aa6 verified
---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- guardrails
- safety
- text-classification
- roberta
- education
- code
- cs-education
- llm-safety
- academic-integrity
datasets:
- md-nishat-008/Do-Not-Code
metrics:
- f1
- accuracy
- precision
- recall
pipeline_tag: text-classification
model-index:
- name: PromptShield
results:
- task:
type: text-classification
name: Prompt Safety Classification
dataset:
type: md-nishat-008/Do-Not-Code
name: Do Not Code
split: test
metrics:
- type: f1
value: 0.93
name: F1 (Macro)
- type: accuracy
value: 0.94
name: Accuracy
---
# PromptShield
<p align="center">
<a href="https://github.com/mraihan-gmu/CodeGuard/tree/main">
<img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub">
</a>
<a href="https://huggingface.co/datasets/md-nishat-008/Do-Not-Code">
<img src="https://img.shields.io/badge/🤗%20Dataset-Do%20Not%20Code-yellow?style=for-the-badge" alt="Dataset">
</a>
<a href="https://aclanthology.org/PLACEHOLDER">
<img src="https://img.shields.io/badge/📄%20Paper-EACL%202026-green?style=for-the-badge" alt="Paper">
</a>
</p>
**PromptShield** is a lightweight guardrail model for detecting unsafe and irrelevant prompts in Computer Science education settings. It achieves **0.93 F1 score**, outperforming existing guardrails by 30-65%.
## Model Description
PromptShield is a RoBERTa-base encoder (125M parameters) fine-tuned on the [Do Not Code dataset](https://huggingface.co/datasets/md-nishat-008/Do-Not-Code) for real-time prompt classification in educational AI systems.
### Intended Use
- **Pre-filtering** user prompts before they reach an AI coding assistant
- **Monitoring** interactions in CS education platforms
- **Research** on LLM safety in educational contexts
### Classification Labels
| ID | Label | Description |
|----|-------|-------------|
| 0 | `irrelevant` | Off-topic queries unrelated to CS coursework |
| 1 | `safe` | Legitimate educational coding requests |
| 2 | `unsafe` | Requests violating academic integrity or safety |
## Performance
### Comparison with Existing Guardrails
| Model/Framework | Type | Size | F1 Score |
|-----------------|------|------|----------|
| **PromptShield (Ours)** | Encoder | 125M | **0.93** |
| Claude 3.7 | Decoder | - | 0.64 |
| GPT-4o | Decoder | - | 0.62 |
| LLaMA Guard | Decoder | 8B | 0.60 |
| Perspective API | Baseline | - | 0.60 |
| NeMo Guard | Decoder | 8B | 0.57 |
| LLaMA 3.2 | Decoder | 8B | 0.34 |
| Random Baseline | - | - | 0.33 |
## Usage
### Quick Start
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("md-nishat-008/promptshield")
tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/promptshield")
# Label mapping
labels = {0: "irrelevant", 1: "safe", 2: "unsafe"}
def classify_prompt(prompt):
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
confidence = torch.softmax(outputs.logits, dim=-1).max().item()
return labels[prediction], confidence
# Examples
prompts = [
"Write a Python function to sort a list using quicksort",
"Explain the French Revolution in Java",
"Generate ransomware code that encrypts all files"
]
for prompt in prompts:
label, conf = classify_prompt(prompt)
print(f"Prompt: {prompt[:50]}...")
print(f"Classification: {label} (confidence: {conf:.2f})")
print("---")
```
### Using the Pipeline API
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="md-nishat-008/promptshield",
tokenizer="md-nishat-008/promptshield"
)
result = classifier("Write a Python function for binary search")
print(result)
# [{'label': 'safe', 'score': 0.98}]
```
### Integration as a Pre-Filter
```python
def safe_llm_query(prompt, llm_function):
"""Wrapper that filters prompts before sending to an LLM."""
label, confidence = classify_prompt(prompt)
if label == "unsafe":
return "I cannot assist with this request as it may violate academic integrity policies."
elif label == "irrelevant":
return "This query appears to be outside the scope of this CS course. Please ask a coding-related question."
else:
return llm_function(prompt)
```
## Training Details
| Parameter | Value |
|-----------|-------|
| Base Model | `roberta-base` |
| Max Sequence Length | 128 |
| Training Epochs | 3 |
| Batch Size | 16 |
| Learning Rate | 2e-5 |
| Optimizer | AdamW (fused) |
| LR Schedule | Linear decay |
| Early Stopping | 2 epochs patience |
| Precision | FP16 (mixed) |
### Training Data
Trained on 6,000 prompts from the Do Not Code dataset:
- 2,250 Irrelevant
- 2,250 Safe
- 1,500 Unsafe
## Limitations
1. **Domain Specificity**: Optimized for introductory/intermediate CS courses. May require adaptation for advanced topics.
2. **Language**: English only.
3. **Context Length**: 128 tokens max. Very long prompts are truncated.
4. **Adversarial Robustness**: May be susceptible to sophisticated jailbreak attempts.
## Citation
```bibtex
@inproceedings{raihan-etal-2026-codeguard,
title = "{C}ode{G}uard: Improving {LLM} Guardrails in {CS} Education",
author = "Raihan, Nishat and
Erdachew, Noah and
Devi, Jayoti and
Santos, Joanna C. S. and
Zampieri, Marcos",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2026",
year = "2026",
publisher = "Association for Computational Linguistics",
}
```
---
<p align="center">
<b>Part of the CodeGuard Framework for Safe AI in CS Education</b>
</p>