distilbert-insecure-output

Fine-tuned DistilBERT classifier that detects dangerous payloads in LLM-generated output.

Covers OWASP LLM Top 10 โ€” LLM02: Insecure Output Handling.

What it detects

Malicious code or injection payloads that an LLM might generate, including:

  • Cross-site scripting (XSS): <script>alert(document.cookie)</script>
  • SQL injection: '; DROP TABLE users; --
  • Command injection: | cat /etc/passwd
  • Path traversal: ../../etc/shadow
  • UNION-based SQL attacks

Labels

Label ID Meaning
SAFE 0 Safe output (normal text, parameterized queries, sanitized code)
MALICIOUS 1 Dangerous payload detected

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="Builder117/distilbert-insecure-output")

clf("<script>alert(document.cookie)</script>")
# [{'label': 'MALICIOUS', 'score': 0.98}]

clf("SELECT * FROM products WHERE id = ?")
# [{'label': 'SAFE', 'score': 0.97}]  # parameterized โ€” safe

Training

  • Base model: distilbert-base-uncased
  • Positive class: XSS payloads, SQL injection strings, command injection, path traversal
  • Negative class: parameterized queries, sanitized code, normal text, safe SQL

Limitations

  • Encoded payloads (base64, HTML entities, hex encoding) may evade detection
  • Context-blind: cannot determine if SQL is parameterized vs. raw string concatenation from text alone
  • May produce false positives on security documentation that quotes attack strings

Part of

LLM Threat Shield โ€” OWASP LLM Top 10 detection suite.

Downloads last month
102
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using Builder117/distilbert-insecure-output 2