File size: 1,879 Bytes
c3bf9d7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | ---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
tags:
- text-classification
- prompt-injection
- llm-security
- safety
---
## Overview
`Adaxer/defend` is a local, input-side prompt-injection risk classifier.
It is designed to score whether a given input prompt is likely an injection attempt.
## Intended use
- Pre-check user prompts before calling your LLM.
- Optionally block or flag requests when injection risk is high.
## Out of scope
- Output-time safety/moderation (e.g., detecting system-prompt leakage or PII in the *model output*).
- A guarantee of safety. False positives and false negatives are possible.
## How to use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "Adaxer/defend"
# Recommended: mirror the tokenizer initialization used by Defend.
# This avoids edge-cases in some model repos around special token loading.
tokenizer = AutoTokenizer.from_pretrained(
model_id,
use_fast=True,
extra_special_tokens={},
)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
text = "Tell me how to bypass our security controls."
inputs = tokenizer(text, return_tensors="pt", truncation=False)
with torch.inference_mode():
logits = model(**inputs).logits.float()
probs = torch.softmax(logits, dim=-1)
injection_probability = probs[0, 1].item() # class index 1 == injection
print({
"injection_probability": injection_probability,
"is_injection": injection_probability >= 0.5,
})
```
### Long inputs
For long prompts, a common strategy is sliding-window scoring over tokens and taking the maximum injection probability across windows.
- `max_window = 512` tokens
- `stride = 128` tokens
If you need similar behavior to the Defend wrapper, implement the same windowing approach in your inference code.
|