PromptShield

PromptShield is a lightweight guardrail model for detecting unsafe and irrelevant prompts in Computer Science education settings. It achieves 0.93 F1 score, outperforming existing guardrails by 30-65%.

Model Description

PromptShield is a RoBERTa-base encoder (125M parameters) fine-tuned on the Do Not Code dataset for real-time prompt classification in educational AI systems.

Intended Use

Pre-filtering user prompts before they reach an AI coding assistant
Monitoring interactions in CS education platforms
Research on LLM safety in educational contexts

Classification Labels

ID	Label	Description
0	`irrelevant`	Off-topic queries unrelated to CS coursework
1	`safe`	Legitimate educational coding requests
2	`unsafe`	Requests violating academic integrity or safety

Performance

Comparison with Existing Guardrails

Model/Framework	Type	Size	F1 Score
PromptShield (Ours)	Encoder	125M	0.93
Claude 3.7	Decoder	-	0.64
GPT-4o	Decoder	-	0.62
LLaMA Guard	Decoder	8B	0.60
Perspective API	Baseline	-	0.60
NeMo Guard	Decoder	8B	0.57
LLaMA 3.2	Decoder	8B	0.34
Random Baseline	-	-	0.33

Usage

Quick Start

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("md-nishat-008/promptshield")
tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/promptshield")

# Label mapping
labels = {0: "irrelevant", 1: "safe", 2: "unsafe"}

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()
    confidence = torch.softmax(outputs.logits, dim=-1).max().item()
    return labels[prediction], confidence

# Examples
prompts = [
    "Write a Python function to sort a list using quicksort",
    "Explain the French Revolution in Java",
    "Generate ransomware code that encrypts all files"
]

for prompt in prompts:
    label, conf = classify_prompt(prompt)
    print(f"Prompt: {prompt[:50]}...")
    print(f"Classification: {label} (confidence: {conf:.2f})")
    print("---")

Using the Pipeline API

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="md-nishat-008/promptshield",
    tokenizer="md-nishat-008/promptshield"
)

result = classifier("Write a Python function for binary search")
print(result)
# [{'label': 'safe', 'score': 0.98}]

Integration as a Pre-Filter

def safe_llm_query(prompt, llm_function):
    """Wrapper that filters prompts before sending to an LLM."""
    label, confidence = classify_prompt(prompt)
    
    if label == "unsafe":
        return "I cannot assist with this request as it may violate academic integrity policies."
    elif label == "irrelevant":
        return "This query appears to be outside the scope of this CS course. Please ask a coding-related question."
    else:
        return llm_function(prompt)

Training Details

Parameter	Value
Base Model	`roberta-base`
Max Sequence Length	128
Training Epochs	3
Batch Size	16
Learning Rate	2e-5
Optimizer	AdamW (fused)
LR Schedule	Linear decay
Early Stopping	2 epochs patience
Precision	FP16 (mixed)

Training Data

Trained on 6,000 prompts from the Do Not Code dataset:

2,250 Irrelevant
2,250 Safe
1,500 Unsafe

Limitations

Domain Specificity: Optimized for introductory/intermediate CS courses. May require adaptation for advanced topics.
Language: English only.
Context Length: 128 tokens max. Very long prompts are truncated.
Adversarial Robustness: May be susceptible to sophisticated jailbreak attempts.

Citation

@inproceedings{raihan-etal-2026-codeguard,
    title = "{C}ode{G}uard: Improving {LLM} Guardrails in {CS} Education",
    author = "Raihan, Nishat  and
      Erdachew, Noah  and
      Devi, Jayoti  and
      Santos, Joanna C. S.  and
      Zampieri, Marcos",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2026",
    year = "2026",
    publisher = "Association for Computational Linguistics",
}

Part of the CodeGuard Framework for Safe AI in CS Education

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for md-nishat-008/PromptShield

Quantizations

1 model

Dataset used to train md-nishat-008/PromptShield

Evaluation results

F1 (Macro) on Do Not Code
test set self-reported

0.930
Accuracy on Do Not Code
test set self-reported

0.940