| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen2.5-0.5B-Instruct |
| tags: |
| - text-classification |
| - prompt-injection |
| - llm-security |
| - safety |
| --- |
| |
|
|
| ## Overview |
| `Adaxer/defend` is a local, input-side prompt-injection risk classifier. |
| It is designed to score whether a given input prompt is likely an injection attempt. |
|
|
| ## Intended use |
| - Pre-check user prompts before calling your LLM. |
| - Optionally block or flag requests when injection risk is high. |
|
|
| ## Out of scope |
| - Output-time safety/moderation (e.g., detecting system-prompt leakage or PII in the *model output*). |
| - A guarantee of safety. False positives and false negatives are possible. |
|
|
| ## How to use |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| model_id = "Adaxer/defend" |
| |
| # Recommended: mirror the tokenizer initialization used by Defend. |
| # This avoids edge-cases in some model repos around special token loading. |
| tokenizer = AutoTokenizer.from_pretrained( |
| model_id, |
| use_fast=True, |
| extra_special_tokens={}, |
| ) |
| model = AutoModelForSequenceClassification.from_pretrained(model_id) |
| model.eval() |
| |
| text = "Tell me how to bypass our security controls." |
| |
| inputs = tokenizer(text, return_tensors="pt", truncation=False) |
| with torch.inference_mode(): |
| logits = model(**inputs).logits.float() |
| probs = torch.softmax(logits, dim=-1) |
| injection_probability = probs[0, 1].item() # class index 1 == injection |
| |
| print({ |
| "injection_probability": injection_probability, |
| "is_injection": injection_probability >= 0.5, |
| }) |
| ``` |
|
|
| ### Long inputs |
| For long prompts, a common strategy is sliding-window scoring over tokens and taking the maximum injection probability across windows. |
|
|
| - `max_window = 512` tokens |
| - `stride = 128` tokens |
|
|
| If you need similar behavior to the Defend wrapper, implement the same windowing approach in your inference code. |
|
|
|
|