PromptShield

GitHub Dataset Paper

PromptShield is a lightweight guardrail model for detecting unsafe and irrelevant prompts in Computer Science education settings. It achieves 0.93 F1 score, outperforming existing guardrails by 30-65%.

Model Description

PromptShield is a RoBERTa-base encoder (125M parameters) fine-tuned on the Do Not Code dataset for real-time prompt classification in educational AI systems.

Intended Use

  • Pre-filtering user prompts before they reach an AI coding assistant
  • Monitoring interactions in CS education platforms
  • Research on LLM safety in educational contexts

Classification Labels

ID Label Description
0 irrelevant Off-topic queries unrelated to CS coursework
1 safe Legitimate educational coding requests
2 unsafe Requests violating academic integrity or safety

Performance

Comparison with Existing Guardrails

Model/Framework Type Size F1 Score
PromptShield (Ours) Encoder 125M 0.93
Claude 3.7 Decoder - 0.64
GPT-4o Decoder - 0.62
LLaMA Guard Decoder 8B 0.60
Perspective API Baseline - 0.60
NeMo Guard Decoder 8B 0.57
LLaMA 3.2 Decoder 8B 0.34
Random Baseline - - 0.33

Usage

Quick Start

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("md-nishat-008/promptshield")
tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/promptshield")

# Label mapping
labels = {0: "irrelevant", 1: "safe", 2: "unsafe"}

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()
    confidence = torch.softmax(outputs.logits, dim=-1).max().item()
    return labels[prediction], confidence

# Examples
prompts = [
    "Write a Python function to sort a list using quicksort",
    "Explain the French Revolution in Java",
    "Generate ransomware code that encrypts all files"
]

for prompt in prompts:
    label, conf = classify_prompt(prompt)
    print(f"Prompt: {prompt[:50]}...")
    print(f"Classification: {label} (confidence: {conf:.2f})")
    print("---")

Using the Pipeline API

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="md-nishat-008/promptshield",
    tokenizer="md-nishat-008/promptshield"
)

result = classifier("Write a Python function for binary search")
print(result)
# [{'label': 'safe', 'score': 0.98}]

Integration as a Pre-Filter

def safe_llm_query(prompt, llm_function):
    """Wrapper that filters prompts before sending to an LLM."""
    label, confidence = classify_prompt(prompt)
    
    if label == "unsafe":
        return "I cannot assist with this request as it may violate academic integrity policies."
    elif label == "irrelevant":
        return "This query appears to be outside the scope of this CS course. Please ask a coding-related question."
    else:
        return llm_function(prompt)

Training Details

Parameter Value
Base Model roberta-base
Max Sequence Length 128
Training Epochs 3
Batch Size 16
Learning Rate 2e-5
Optimizer AdamW (fused)
LR Schedule Linear decay
Early Stopping 2 epochs patience
Precision FP16 (mixed)

Training Data

Trained on 6,000 prompts from the Do Not Code dataset:

  • 2,250 Irrelevant
  • 2,250 Safe
  • 1,500 Unsafe

Limitations

  1. Domain Specificity: Optimized for introductory/intermediate CS courses. May require adaptation for advanced topics.
  2. Language: English only.
  3. Context Length: 128 tokens max. Very long prompts are truncated.
  4. Adversarial Robustness: May be susceptible to sophisticated jailbreak attempts.

Citation

@inproceedings{raihan-etal-2026-codeguard,
    title = "{C}ode{G}uard: Improving {LLM} Guardrails in {CS} Education",
    author = "Raihan, Nishat  and
      Erdachew, Noah  and
      Devi, Jayoti  and
      Santos, Joanna C. S.  and
      Zampieri, Marcos",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2026",
    year = "2026",
    publisher = "Association for Computational Linguistics",
}

Part of the CodeGuard Framework for Safe AI in CS Education

Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for md-nishat-008/PromptShield

Quantizations
1 model

Dataset used to train md-nishat-008/PromptShield

Evaluation results