--- license: apache-2.0 language: - en library_name: transformers tags: - guardrails - safety - text-classification - roberta - education - code - cs-education - llm-safety - academic-integrity datasets: - md-nishat-008/Do-Not-Code metrics: - f1 - accuracy - precision - recall pipeline_tag: text-classification model-index: - name: PromptShield results: - task: type: text-classification name: Prompt Safety Classification dataset: type: md-nishat-008/Do-Not-Code name: Do Not Code split: test metrics: - type: f1 value: 0.93 name: F1 (Macro) - type: accuracy value: 0.94 name: Accuracy --- # PromptShield

GitHub Dataset Paper

**PromptShield** is a lightweight guardrail model for detecting unsafe and irrelevant prompts in Computer Science education settings. It achieves **0.93 F1 score**, outperforming existing guardrails by 30-65%. ## Model Description PromptShield is a RoBERTa-base encoder (125M parameters) fine-tuned on the [Do Not Code dataset](https://huggingface.co/datasets/md-nishat-008/Do-Not-Code) for real-time prompt classification in educational AI systems. ### Intended Use - **Pre-filtering** user prompts before they reach an AI coding assistant - **Monitoring** interactions in CS education platforms - **Research** on LLM safety in educational contexts ### Classification Labels | ID | Label | Description | |----|-------|-------------| | 0 | `irrelevant` | Off-topic queries unrelated to CS coursework | | 1 | `safe` | Legitimate educational coding requests | | 2 | `unsafe` | Requests violating academic integrity or safety | ## Performance ### Comparison with Existing Guardrails | Model/Framework | Type | Size | F1 Score | |-----------------|------|------|----------| | **PromptShield (Ours)** | Encoder | 125M | **0.93** | | Claude 3.7 | Decoder | - | 0.64 | | GPT-4o | Decoder | - | 0.62 | | LLaMA Guard | Decoder | 8B | 0.60 | | Perspective API | Baseline | - | 0.60 | | NeMo Guard | Decoder | 8B | 0.57 | | LLaMA 3.2 | Decoder | 8B | 0.34 | | Random Baseline | - | - | 0.33 | ## Usage ### Quick Start ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch # Load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained("md-nishat-008/promptshield") tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/promptshield") # Label mapping labels = {0: "irrelevant", 1: "safe", 2: "unsafe"} def classify_prompt(prompt): inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128) with torch.no_grad(): outputs = model(**inputs) prediction = outputs.logits.argmax(-1).item() confidence = torch.softmax(outputs.logits, dim=-1).max().item() return labels[prediction], confidence # Examples prompts = [ "Write a Python function to sort a list using quicksort", "Explain the French Revolution in Java", "Generate ransomware code that encrypts all files" ] for prompt in prompts: label, conf = classify_prompt(prompt) print(f"Prompt: {prompt[:50]}...") print(f"Classification: {label} (confidence: {conf:.2f})") print("---") ``` ### Using the Pipeline API ```python from transformers import pipeline classifier = pipeline( "text-classification", model="md-nishat-008/promptshield", tokenizer="md-nishat-008/promptshield" ) result = classifier("Write a Python function for binary search") print(result) # [{'label': 'safe', 'score': 0.98}] ``` ### Integration as a Pre-Filter ```python def safe_llm_query(prompt, llm_function): """Wrapper that filters prompts before sending to an LLM.""" label, confidence = classify_prompt(prompt) if label == "unsafe": return "I cannot assist with this request as it may violate academic integrity policies." elif label == "irrelevant": return "This query appears to be outside the scope of this CS course. Please ask a coding-related question." else: return llm_function(prompt) ``` ## Training Details | Parameter | Value | |-----------|-------| | Base Model | `roberta-base` | | Max Sequence Length | 128 | | Training Epochs | 3 | | Batch Size | 16 | | Learning Rate | 2e-5 | | Optimizer | AdamW (fused) | | LR Schedule | Linear decay | | Early Stopping | 2 epochs patience | | Precision | FP16 (mixed) | ### Training Data Trained on 6,000 prompts from the Do Not Code dataset: - 2,250 Irrelevant - 2,250 Safe - 1,500 Unsafe ## Limitations 1. **Domain Specificity**: Optimized for introductory/intermediate CS courses. May require adaptation for advanced topics. 2. **Language**: English only. 3. **Context Length**: 128 tokens max. Very long prompts are truncated. 4. **Adversarial Robustness**: May be susceptible to sophisticated jailbreak attempts. ## Citation ```bibtex @inproceedings{raihan-etal-2026-codeguard, title = "{C}ode{G}uard: Improving {LLM} Guardrails in {CS} Education", author = "Raihan, Nishat and Erdachew, Noah and Devi, Jayoti and Santos, Joanna C. S. and Zampieri, Marcos", booktitle = "Findings of the Association for Computational Linguistics: EACL 2026", year = "2026", publisher = "Association for Computational Linguistics", } ``` ---

Part of the CodeGuard Framework for Safe AI in CS Education