PromptShield / README.md

Update README.md

05e8aa6 verified 7 days ago

6.03 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	tags:
	- guardrails
	- safety
	- text-classification
	- roberta
	- education
	- code
	- cs-education
	- llm-safety
	- academic-integrity
	datasets:
	- md-nishat-008/Do-Not-Code
	metrics:
	- f1
	- accuracy
	- precision
	- recall
	pipeline_tag: text-classification
	model-index:
	- name: PromptShield
	results:
	- task:
	type: text-classification
	name: Prompt Safety Classification
	dataset:
	type: md-nishat-008/Do-Not-Code
	name: Do Not Code
	split: test
	metrics:
	- type: f1
	value: 0.93
	name: F1 (Macro)
	- type: accuracy
	value: 0.94
	name: Accuracy
	---

	# PromptShield

	<p align="center">
	<a href="https://github.com/mraihan-gmu/CodeGuard/tree/main">
	<img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub">
	</a>
	<a href="https://huggingface.co/datasets/md-nishat-008/Do-Not-Code">
	<img src="https://img.shields.io/badge/🤗%20Dataset-Do%20Not%20Code-yellow?style=for-the-badge" alt="Dataset">
	</a>
	<a href="https://aclanthology.org/PLACEHOLDER">
	<img src="https://img.shields.io/badge/📄%20Paper-EACL%202026-green?style=for-the-badge" alt="Paper">
	</a>
	</p>

	PromptShield is a lightweight guardrail model for detecting unsafe and irrelevant prompts in Computer Science education settings. It achieves 0.93 F1 score, outperforming existing guardrails by 30-65%.

	## Model Description

	PromptShield is a RoBERTa-base encoder (125M parameters) fine-tuned on the [Do Not Code dataset](https://huggingface.co/datasets/md-nishat-008/Do-Not-Code) for real-time prompt classification in educational AI systems.

	### Intended Use

	- Pre-filtering user prompts before they reach an AI coding assistant
	- Monitoring interactions in CS education platforms
	- Research on LLM safety in educational contexts

	### Classification Labels

	\| ID \| Label \| Description \|
	\|----\|-------\|-------------\|
	\| 0 \| `irrelevant` \| Off-topic queries unrelated to CS coursework \|
	\| 1 \| `safe` \| Legitimate educational coding requests \|
	\| 2 \| `unsafe` \| Requests violating academic integrity or safety \|

	## Performance

	### Comparison with Existing Guardrails

	\| Model/Framework \| Type \| Size \| F1 Score \|
	\|-----------------\|------\|------\|----------\|
	\| PromptShield (Ours) \| Encoder \| 125M \| 0.93 \|
	\| Claude 3.7 \| Decoder \| - \| 0.64 \|
	\| GPT-4o \| Decoder \| - \| 0.62 \|
	\| LLaMA Guard \| Decoder \| 8B \| 0.60 \|
	\| Perspective API \| Baseline \| - \| 0.60 \|
	\| NeMo Guard \| Decoder \| 8B \| 0.57 \|
	\| LLaMA 3.2 \| Decoder \| 8B \| 0.34 \|
	\| Random Baseline \| - \| - \| 0.33 \|

	## Usage

	### Quick Start

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained("md-nishat-008/promptshield")
	tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/promptshield")

	# Label mapping
	labels = {0: "irrelevant", 1: "safe", 2: "unsafe"}

	def classify_prompt(prompt):
	inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128)
	with torch.no_grad():
	outputs = model(**inputs)
	prediction = outputs.logits.argmax(-1).item()
	confidence = torch.softmax(outputs.logits, dim=-1).max().item()
	return labels[prediction], confidence

	# Examples
	prompts = [
	"Write a Python function to sort a list using quicksort",
	"Explain the French Revolution in Java",
	"Generate ransomware code that encrypts all files"
	]

	for prompt in prompts:
	label, conf = classify_prompt(prompt)
	print(f"Prompt: {prompt[:50]}...")
	print(f"Classification: {label} (confidence: {conf:.2f})")
	print("---")
	```

	### Using the Pipeline API

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="md-nishat-008/promptshield",
	tokenizer="md-nishat-008/promptshield"
	)

	result = classifier("Write a Python function for binary search")
	print(result)
	# [{'label': 'safe', 'score': 0.98}]
	```

	### Integration as a Pre-Filter

	```python
	def safe_llm_query(prompt, llm_function):
	"""Wrapper that filters prompts before sending to an LLM."""
	label, confidence = classify_prompt(prompt)

	if label == "unsafe":
	return "I cannot assist with this request as it may violate academic integrity policies."
	elif label == "irrelevant":
	return "This query appears to be outside the scope of this CS course. Please ask a coding-related question."
	else:
	return llm_function(prompt)
	```

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| `roberta-base` \|
	\| Max Sequence Length \| 128 \|
	\| Training Epochs \| 3 \|
	\| Batch Size \| 16 \|
	\| Learning Rate \| 2e-5 \|
	\| Optimizer \| AdamW (fused) \|
	\| LR Schedule \| Linear decay \|
	\| Early Stopping \| 2 epochs patience \|
	\| Precision \| FP16 (mixed) \|

	### Training Data

	Trained on 6,000 prompts from the Do Not Code dataset:
	- 2,250 Irrelevant
	- 2,250 Safe
	- 1,500 Unsafe

	## Limitations

	1. Domain Specificity: Optimized for introductory/intermediate CS courses. May require adaptation for advanced topics.
	2. Language: English only.
	3. Context Length: 128 tokens max. Very long prompts are truncated.
	4. Adversarial Robustness: May be susceptible to sophisticated jailbreak attempts.

	## Citation

	```bibtex
	@inproceedings{raihan-etal-2026-codeguard,
	title = "{C}ode{G}uard: Improving {LLM} Guardrails in {CS} Education",
	author = "Raihan, Nishat and
	Erdachew, Noah and
	Devi, Jayoti and
	Santos, Joanna C. S. and
	Zampieri, Marcos",
	booktitle = "Findings of the Association for Computational Linguistics: EACL 2026",
	year = "2026",
	publisher = "Association for Computational Linguistics",
	}
	```

	---

	<p align="center">
	<b>Part of the CodeGuard Framework for Safe AI in CS Education</b>
	</p>