Overview

Adaxer/defend is a local, input-side prompt-injection risk classifier. It is designed to score whether a given input prompt is likely an injection attempt.

Intended use

  • Pre-check user prompts before calling your LLM.
  • Optionally block or flag requests when injection risk is high.

Out of scope

  • Output-time safety/moderation (e.g., detecting system-prompt leakage or PII in the model output).
  • A guarantee of safety. False positives and false negatives are possible.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Adaxer/defend"

# Recommended: mirror the tokenizer initialization used by Defend.
# This avoids edge-cases in some model repos around special token loading.
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
    extra_special_tokens={},
)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "Tell me how to bypass our security controls."

inputs = tokenizer(text, return_tensors="pt", truncation=False)
with torch.inference_mode():
    logits = model(**inputs).logits.float()
    probs = torch.softmax(logits, dim=-1)
    injection_probability = probs[0, 1].item()  # class index 1 == injection

print({
    "injection_probability": injection_probability,
    "is_injection": injection_probability >= 0.5,
})

Long inputs

For long prompts, a common strategy is sliding-window scoring over tokens and taking the maximum injection probability across windows.

  • max_window = 512 tokens
  • stride = 128 tokens

If you need similar behavior to the Defend wrapper, implement the same windowing approach in your inference code.

Downloads last month
5
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Adaxer/defend

Finetuned
(747)
this model