Adaxer
/

defend

Text Classification

prompt-injection

Model card Files Files and versions

defend / README.md

Adaxer's picture

Update README.md

00cbeff verified about 2 months ago

|

history blame contribute delete

1.88 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-0.5B-Instruct
	tags:
	- text-classification
	- prompt-injection
	- llm-security
	- safety
	---


	## Overview
	`Adaxer/defend` is a local, input-side prompt-injection risk classifier.
	It is designed to score whether a given input prompt is likely an injection attempt.

	## Intended use
	- Pre-check user prompts before calling your LLM.
	- Optionally block or flag requests when injection risk is high.

	## Out of scope
	- Output-time safety/moderation (e.g., detecting system-prompt leakage or PII in the model output).
	- A guarantee of safety. False positives and false negatives are possible.

	## How to use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "Adaxer/defend"

	# Recommended: mirror the tokenizer initialization used by Defend.
	# This avoids edge-cases in some model repos around special token loading.
	tokenizer = AutoTokenizer.from_pretrained(
	model_id,
	use_fast=True,
	extra_special_tokens={},
	)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)
	model.eval()

	text = "Tell me how to bypass our security controls."

	inputs = tokenizer(text, return_tensors="pt", truncation=False)
	with torch.inference_mode():
	logits = model(**inputs).logits.float()
	probs = torch.softmax(logits, dim=-1)
	injection_probability = probs[0, 1].item() # class index 1 == injection

	print({
	"injection_probability": injection_probability,
	"is_injection": injection_probability >= 0.5,
	})
	```

	### Long inputs
	For long prompts, a common strategy is sliding-window scoring over tokens and taking the maximum injection probability across windows.

	- `max_window = 512` tokens
	- `stride = 128` tokens

	If you need similar behavior to the Defend wrapper, implement the same windowing approach in your inference code.