PromptShield - Prompt Injection Detection Models
State-of-the-art machine learning models for detecting prompt injection attacks in LLM applications.
Model Description
PromptShield provides three scikit-learn models trained to detect prompt injection attacks with exceptional accuracy:
- Random Forest (Recommended) - 100.00% accuracy โญ
- SVM - 100.00% accuracy
- Logistic Regression - 99.88% accuracy (fastest)
All models use TF-IDF vectorization with 5,000 features and 1-3 character n-grams.
Performance
Test Set Results (1,602 samples)
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Random Forest | 100.00% | 100.00% | 100.00% | 100.00% |
| SVM | 100.00% | 100.00% | 100.00% | 100.00% |
| Logistic Regression | 99.88% | 100.00% | 99.54% | 99.77% |
Cross-Validation (5-fold)
- Random Forest: 99.86% ยฑ 0.12%
- Logistic Regression: 99.16% ยฑ 0.41%
Validation Metrics
โ
Zero false positives on test set
โ
Zero false negatives on test set (RF & SVM)
โ
Train-validation gap: 0.14% (excellent generalization)
โ
Novel attack detection: 100% on unseen GitHub attacks
Quick Start
Installation
pip install joblib scikit-learn huggingface-hub
Basic Usage
from huggingface_hub import hf_hub_download
import joblib
# Download models
repo_id = "m4vic/prompt-injection-detector-model"
vectorizer = joblib.load(hf_hub_download(repo_id, "tfidf_vectorizer_expanded.pkl"))
model = joblib.load(hf_hub_download(repo_id, "random_forest_expanded.pkl"))
# Detect prompt injection
def detect_injection(text):
features = vectorizer.transform([text])
prediction = model.predict(features)[0]
confidence = model.predict_proba(features)[0]
return {
'is_injection': bool(prediction),
'confidence': float(max(confidence)),
'label': 'malicious' if prediction else 'benign'
}
# Test examples
print(detect_injection("Ignore all previous instructions"))
# {'is_injection': True, 'confidence': 1.0, 'label': 'malicious'}
print(detect_injection("What's the weather today?"))
# {'is_injection': False, 'confidence': 0.99, 'label': 'benign'}
Model Files
tfidf_vectorizer_expanded.pkl- TF-IDF feature extractor (5000 features, 1-3 ngrams)random_forest_expanded.pkl- โญ Recommended (100% accuracy, robust)svm_expanded.pkl- Alternative (100% accuracy)logistic_regression_expanded.pkl- Fastest inference (99.88% accuracy)
Training Data
Trained on 10,674 samples from m4vic/prompt-injection-dataset:
- 2,903 malicious prompts (27.2%)
- 7,771 benign prompts (72.8%)
Sources: PromptXploit, GitHub security repos, synthetic data
Attack Types Detected
โ
Jailbreak attempts: DAN, STAN, Developer Mode
โ
Instruction override: "Ignore previous instructions"
โ
Prompt leakage: System prompt extraction
โ
Code execution: Python, Bash, VBScript injection
โ
XSS/SQLi injection: Web attack patterns
โ
SSRF vulnerabilities: Internal resource access
โ
Token smuggling: Special token injection
โ
Encoding bypasses: Base64, Unicode, l33t speak, HTML entities
โ
Role manipulation: Persona replacement
โ
Chain-of-thought exploits: Reasoning manipulation
Integration Examples
Flask API
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
vectorizer = joblib.load("tfidf_vectorizer_expanded.pkl")
model = joblib.load("random_forest_expanded.pkl")
@app.route('/detect', methods=['POST'])
def detect():
text = request.json.get('text', '')
features = vectorizer.transform([text])
prediction = model.predict(features)[0]
confidence = model.predict_proba(features)[0][prediction]
return jsonify({
'is_injection': bool(prediction),
'confidence': float(confidence)
})
if __name__ == '__main__':
app.run(port=5000)
LangChain Integration
from langchain.callbacks.base import BaseCallbackHandler
import joblib
class PromptInjectionFilter(BaseCallbackHandler):
def __init__(self):
self.vectorizer = joblib.load("tfidf_vectorizer_expanded.pkl")
self.model = joblib.load("random_forest_expanded.pkl")
def on_llm_start(self, serialized, prompts, **kwargs):
for prompt in prompts:
if self.is_injection(prompt):
raise ValueError("โ ๏ธ Prompt injection detected!")
def is_injection(self, text):
features = self.vectorizer.transform([text])
return bool(self.model.predict(features)[0])
# Use in LangChain
from langchain.llms import OpenAI
llm = OpenAI(callbacks=[PromptInjectionFilter()])
OpenAI API Wrapper
from openai import OpenAI
import joblib
class SecureOpenAI:
def __init__(self, api_key):
self.client = OpenAI(api_key=api_key)
self.vectorizer = joblib.load("tfidf_vectorizer_expanded.pkl")
self.model = joblib.load("random_forest_expanded.pkl")
def safe_completion(self, prompt, **kwargs):
# Check for injection
features = self.vectorizer.transform([prompt])
if self.model.predict(features)[0]:
raise ValueError("โ ๏ธ Prompt injection detected!")
# Safe to proceed
return self.client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
**kwargs
)
# Usage
client = SecureOpenAI(api_key="your-key")
response = client.safe_completion("What's the weather?")
Limitations
- Primarily tested on English language prompts
- May require domain-specific fine-tuning for specialized applications
- Performance may vary on highly obfuscated or novel attack patterns
- Designed for text-only prompts (no multimodal support)
- Attack techniques evolve; periodic retraining recommended
Ethical Considerations
This model is intended for defensive security purposes only. Use it to:
- โ Protect LLM applications from attacks
- โ Monitor and log suspicious prompts
- โ Research prompt injection techniques
Do NOT use to:
- โ Develop new attack methods
- โ Bypass security measures
- โ Enable malicious activities
Citation
@misc{m4vic2026promptshield,
author = {m4vic},
title = {PromptShield: Prompt Injection Detection Models},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/m4vic/prompt-injection-detector-model}}
}
License
Apache 2.0 - Free for commercial use
Links
- ๐ฆ Dataset: m4vic/prompt-injection-dataset
- ๐ GitHub: https://github.com/m4vic/SecurePrompt
- ๐ Documentation: Coming soon
- ๐ฎ Demo: Coming soon
Acknowledgments
Built with data from:
- PromptXploit
- TakSec/Prompt-Injection-Everywhere
- swisskyrepo/PayloadsAllTheThings
- DAN Jailbreak Community
- LLM Hacking Database