--- license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-0.5B-Instruct tags: - text-classification - prompt-injection - llm-security - safety --- ## Overview `Adaxer/defend` is a local, input-side prompt-injection risk classifier. It is designed to score whether a given input prompt is likely an injection attempt. ## Intended use - Pre-check user prompts before calling your LLM. - Optionally block or flag requests when injection risk is high. ## Out of scope - Output-time safety/moderation (e.g., detecting system-prompt leakage or PII in the *model output*). - A guarantee of safety. False positives and false negatives are possible. ## How to use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "Adaxer/defend" # Recommended: mirror the tokenizer initialization used by Defend. # This avoids edge-cases in some model repos around special token loading. tokenizer = AutoTokenizer.from_pretrained( model_id, use_fast=True, extra_special_tokens={}, ) model = AutoModelForSequenceClassification.from_pretrained(model_id) model.eval() text = "Tell me how to bypass our security controls." inputs = tokenizer(text, return_tensors="pt", truncation=False) with torch.inference_mode(): logits = model(**inputs).logits.float() probs = torch.softmax(logits, dim=-1) injection_probability = probs[0, 1].item() # class index 1 == injection print({ "injection_probability": injection_probability, "is_injection": injection_probability >= 0.5, }) ``` ### Long inputs For long prompts, a common strategy is sliding-window scoring over tokens and taking the maximum injection probability across windows. - `max_window = 512` tokens - `stride = 128` tokens If you need similar behavior to the Defend wrapper, implement the same windowing approach in your inference code.