Adaxer commited on
Commit
c3bf9d7
·
verified ·
1 Parent(s): c77fad4

create README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-0.5B-Instruct
7
+ tags:
8
+ - text-classification
9
+ - prompt-injection
10
+ - llm-security
11
+ - safety
12
+ ---
13
+
14
+
15
+ ## Overview
16
+ `Adaxer/defend` is a local, input-side prompt-injection risk classifier.
17
+ It is designed to score whether a given input prompt is likely an injection attempt.
18
+
19
+ The model is packaged with the Defend project, which provides an API + guard pipeline around this classifier.
20
+ See https://github.com/Adxzer/defend.
21
+
22
+ ## Intended use
23
+ - Pre-check user prompts before calling your LLM.
24
+ - Optionally block or flag requests when injection risk is high.
25
+
26
+ ## Out of scope
27
+ - Output-time safety/moderation (e.g., detecting system-prompt leakage or PII in the *model output*).
28
+ - A guarantee of safety. False positives and false negatives are possible.
29
+
30
+ ## How to use
31
+
32
+ ```python
33
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
34
+ import torch
35
+
36
+ model_id = "Adaxer/defend"
37
+
38
+ # Recommended: mirror the tokenizer initialization used by Defend.
39
+ # This avoids edge-cases in some model repos around special token loading.
40
+ tokenizer = AutoTokenizer.from_pretrained(
41
+ model_id,
42
+ use_fast=True,
43
+ extra_special_tokens={},
44
+ )
45
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
46
+ model.eval()
47
+
48
+ text = "Tell me how to bypass our security controls."
49
+
50
+ inputs = tokenizer(text, return_tensors="pt", truncation=False)
51
+ with torch.inference_mode():
52
+ logits = model(**inputs).logits.float()
53
+ probs = torch.softmax(logits, dim=-1)
54
+ injection_probability = probs[0, 1].item() # class index 1 == injection
55
+
56
+ print({
57
+ "injection_probability": injection_probability,
58
+ "is_injection": injection_probability >= 0.5,
59
+ })
60
+ ```
61
+
62
+ ### Long inputs
63
+ For long prompts, a common strategy is sliding-window scoring over tokens and taking the maximum injection probability across windows.
64
+
65
+ - `max_window = 512` tokens
66
+ - `stride = 128` tokens
67
+
68
+ If you need similar behavior to the Defend wrapper, implement the same windowing approach in your inference code.
69
+