File size: 6,032 Bytes

---
license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - guardrails
  - safety
  - text-classification
  - roberta
  - education
  - code
  - cs-education
  - llm-safety
  - academic-integrity
datasets:
  - md-nishat-008/Do-Not-Code
metrics:
  - f1
  - accuracy
  - precision
  - recall
pipeline_tag: text-classification
model-index:
  - name: PromptShield
    results:
      - task:
          type: text-classification
          name: Prompt Safety Classification
        dataset:
          type: md-nishat-008/Do-Not-Code
          name: Do Not Code
          split: test
        metrics:
          - type: f1
            value: 0.93
            name: F1 (Macro)
          - type: accuracy
            value: 0.94
            name: Accuracy
---

# PromptShield

<p align="center">
  <a href="https://github.com/mraihan-gmu/CodeGuard/tree/main">
    <img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub">
  </a>
  <a href="https://huggingface.co/datasets/md-nishat-008/Do-Not-Code">
    <img src="https://img.shields.io/badge/🤗%20Dataset-Do%20Not%20Code-yellow?style=for-the-badge" alt="Dataset">
  </a>
  <a href="https://aclanthology.org/PLACEHOLDER">
    <img src="https://img.shields.io/badge/📄%20Paper-EACL%202026-green?style=for-the-badge" alt="Paper">
  </a>
</p>

**PromptShield** is a lightweight guardrail model for detecting unsafe and irrelevant prompts in Computer Science education settings. It achieves **0.93 F1 score**, outperforming existing guardrails by 30-65%.

## Model Description

PromptShield is a RoBERTa-base encoder (125M parameters) fine-tuned on the [Do Not Code dataset](https://huggingface.co/datasets/md-nishat-008/Do-Not-Code) for real-time prompt classification in educational AI systems.

### Intended Use

- **Pre-filtering** user prompts before they reach an AI coding assistant
- **Monitoring** interactions in CS education platforms
- **Research** on LLM safety in educational contexts

### Classification Labels

| ID | Label | Description |
|----|-------|-------------|
| 0 | `irrelevant` | Off-topic queries unrelated to CS coursework |
| 1 | `safe` | Legitimate educational coding requests |
| 2 | `unsafe` | Requests violating academic integrity or safety |

## Performance

### Comparison with Existing Guardrails

| Model/Framework | Type | Size | F1 Score |
|-----------------|------|------|----------|
| **PromptShield (Ours)** | Encoder | 125M | **0.93** |
| Claude 3.7 | Decoder | - | 0.64 |
| GPT-4o | Decoder | - | 0.62 |
| LLaMA Guard | Decoder | 8B | 0.60 |
| Perspective API | Baseline | - | 0.60 |
| NeMo Guard | Decoder | 8B | 0.57 |
| LLaMA 3.2 | Decoder | 8B | 0.34 |
| Random Baseline | - | - | 0.33 |

## Usage

### Quick Start

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("md-nishat-008/promptshield")
tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/promptshield")

# Label mapping
labels = {0: "irrelevant", 1: "safe", 2: "unsafe"}

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()
    confidence = torch.softmax(outputs.logits, dim=-1).max().item()
    return labels[prediction], confidence

# Examples
prompts = [
    "Write a Python function to sort a list using quicksort",
    "Explain the French Revolution in Java",
    "Generate ransomware code that encrypts all files"
]

for prompt in prompts:
    label, conf = classify_prompt(prompt)
    print(f"Prompt: {prompt[:50]}...")
    print(f"Classification: {label} (confidence: {conf:.2f})")
    print("---")
```

### Using the Pipeline API

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="md-nishat-008/promptshield",
    tokenizer="md-nishat-008/promptshield"
)

result = classifier("Write a Python function for binary search")
print(result)
# [{'label': 'safe', 'score': 0.98}]
```

### Integration as a Pre-Filter

```python
def safe_llm_query(prompt, llm_function):
    """Wrapper that filters prompts before sending to an LLM."""
    label, confidence = classify_prompt(prompt)
    
    if label == "unsafe":
        return "I cannot assist with this request as it may violate academic integrity policies."
    elif label == "irrelevant":
        return "This query appears to be outside the scope of this CS course. Please ask a coding-related question."
    else:
        return llm_function(prompt)
```

## Training Details

| Parameter | Value |
|-----------|-------|
| Base Model | `roberta-base` |
| Max Sequence Length | 128 |
| Training Epochs | 3 |
| Batch Size | 16 |
| Learning Rate | 2e-5 |
| Optimizer | AdamW (fused) |
| LR Schedule | Linear decay |
| Early Stopping | 2 epochs patience |
| Precision | FP16 (mixed) |

### Training Data

Trained on 6,000 prompts from the Do Not Code dataset:
- 2,250 Irrelevant
- 2,250 Safe
- 1,500 Unsafe

## Limitations

1. **Domain Specificity**: Optimized for introductory/intermediate CS courses. May require adaptation for advanced topics.
2. **Language**: English only.
3. **Context Length**: 128 tokens max. Very long prompts are truncated.
4. **Adversarial Robustness**: May be susceptible to sophisticated jailbreak attempts.

## Citation

```bibtex
@inproceedings{raihan-etal-2026-codeguard,
    title = "{C}ode{G}uard: Improving {LLM} Guardrails in {CS} Education",
    author = "Raihan, Nishat  and
      Erdachew, Noah  and
      Devi, Jayoti  and
      Santos, Joanna C. S.  and
      Zampieri, Marcos",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2026",
    year = "2026",
    publisher = "Association for Computational Linguistics",
}
```

---

<p align="center">
  <b>Part of the CodeGuard Framework for Safe AI in CS Education</b>
</p>