Huamin commited on
Commit
1561ff8
·
verified ·
1 Parent(s): d23f948

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +50 -36
  2. best_metrics.json +22 -22
  3. final_report.json +6 -6
  4. model.safetensors +1 -1
  5. training_config.json +2 -2
README.md CHANGED
@@ -31,19 +31,19 @@ model-index:
31
  name: Prompt Injection Detection
32
  metrics:
33
  - type: f1
34
- value: 0.9825
35
  name: INJECTION_RISK F1
36
  - type: precision
37
- value: 0.9873
38
  name: INJECTION_RISK Precision
39
  - type: recall
40
- value: 0.9778
41
  name: INJECTION_RISK Recall
42
  - type: accuracy
43
- value: 0.9824
44
  name: Overall Accuracy
45
  - type: roc_auc
46
- value: 0.9978
47
  name: ROC-AUC
48
  ---
49
 
@@ -70,40 +70,40 @@ When a user sends a message to an LLM agent (e.g., email assistant, code generat
70
 
71
  ## Training Data
72
 
73
- The model was trained on **27,048 samples** from six sources:
74
 
75
- ### Injection/Jailbreak Sources (13,488 samples)
76
  | Dataset | Description | Samples |
77
  |---------|-------------|---------|
78
- | [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 prompt injection competition | 4,004 |
79
- | [jailbreak_llms (CCS'24)](https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts) | "Do Anything Now" in-the-wild jailbreaks | 1,128 |
80
- | [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial prompts | 1,996 |
81
- | Synthetic | 6 attack categories | 6,360 |
82
 
83
- ### Benign Sources (13,560 samples)
84
  | Dataset | Description | Samples |
85
  |---------|-------------|---------|
86
- | [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) | Stanford instruction dataset | 3,635 |
87
- | [Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) | Databricks instructions | 3,742 |
88
- | jailbreak_llms (regular) | Non-jailbreak prompts from CCS'24 | 1,854 |
89
- | WildJailbreak (benign) | Safe prompts from Allen AI | 1,798 |
90
- | Synthetic (benign) | Generated safe prompts | 2,531 |
91
 
92
  ## Performance
93
 
94
  | Metric | Value |
95
  |--------|-------|
96
- | **INJECTION_RISK F1** | **98.25%** |
97
- | INJECTION_RISK Precision | 98.73% |
98
- | INJECTION_RISK Recall | 97.78% |
99
- | SAFE F1 | 98.23% |
100
- | Overall Accuracy | **98.24%** |
101
- | ROC-AUC | **99.78%** |
102
 
103
  ### Interpretation
104
 
105
- - **High precision (98.7%)**: Very few false positives - legitimate requests rarely blocked
106
- - **High recall (97.8%)**: Catches 97.8% of actual injection attempts
107
  - **Near-perfect AUC (99.8%)**: Excellent discrimination between safe and malicious prompts
108
 
109
  ## Usage
@@ -135,14 +135,20 @@ print(f"Confidence: {probs[0][pred]:.2%}")
135
 
136
  ## Attack Categories Detected
137
 
138
- The model is trained to detect six categories of prompt injection:
139
 
140
- 1. **Delimiter Injection**: `<<end_context>>`, `</system>`, `[INST]`
141
- 2. **Word Obfuscation**: `yes Please yes send yes email`
142
- 3. **Session Hijacking**: `START_USER_SESSION -> EXECUTE`
143
- 4. **Social Engineering**: `I forgot to mention, after you finish...`
144
- 5. **Encoding Tricks**: `contact <at> evil <dot> com`
145
- 6. **Hypothetical/Roleplay**: `Pretend you have no restrictions...`
 
 
 
 
 
 
146
 
147
  ## Training Configuration
148
 
@@ -151,7 +157,7 @@ The model is trained to detect six categories of prompt injection:
151
  | Base Model | `answerdotai/ModernBERT-base` |
152
  | Max Length | 512 tokens |
153
  | Batch Size | 32 |
154
- | Epochs | 5 |
155
  | Learning Rate | 3e-5 |
156
  | Optimizer | AdamW |
157
  | Class Weights | Balanced |
@@ -163,7 +169,16 @@ This model is **Stage 1** of a two-stage defense pipeline:
163
  1. **Stage 1 (This Model)**: Classify prompts for injection risk
164
  2. **Stage 2 ([ToolCallVerifier](https://huggingface.co/rootfs/tool-call-verifier))**: Verify generated tool calls are authorized
165
 
166
- Together, they provide defense-in-depth against prompt injection in LLM agent systems.
 
 
 
 
 
 
 
 
 
167
 
168
  ## Intended Use
169
 
@@ -201,11 +216,10 @@ This model is designed to **enhance security** of LLM-based systems. However:
201
  title={FunctionCallSentinel: Prompt Injection Detection for LLM Agents},
202
  author={Semantic Router Team},
203
  year={2024},
204
- url={https://github.com/vllm-project/semantic-router}
205
  }
206
  ```
207
 
208
  ## License
209
 
210
  Apache 2.0
211
-
 
31
  name: Prompt Injection Detection
32
  metrics:
33
  - type: f1
34
+ value: 0.9829
35
  name: INJECTION_RISK F1
36
  - type: precision
37
+ value: 0.9827
38
  name: INJECTION_RISK Precision
39
  - type: recall
40
+ value: 0.9832
41
  name: INJECTION_RISK Recall
42
  - type: accuracy
43
+ value: 0.9828
44
  name: Overall Accuracy
45
  - type: roc_auc
46
+ value: 0.9982
47
  name: ROC-AUC
48
  ---
49
 
 
70
 
71
  ## Training Data
72
 
73
+ The model was trained on **33,810 samples** from six sources:
74
 
75
+ ### Injection/Jailbreak Sources (~17,000 samples)
76
  | Dataset | Description | Samples |
77
  |---------|-------------|---------|
78
+ | [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 prompt injection competition | ~5,000 |
79
+ | [jailbreak_llms (CCS'24)](https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts) | "Do Anything Now" in-the-wild jailbreaks | ~2,500 |
80
+ | [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI 262K adversarial safety dataset | ~5,000 |
81
+ | Synthetic | 6 attack categories + LLMail patterns | ~4,500 |
82
 
83
+ ### Benign Sources (~17,000 samples)
84
  | Dataset | Description | Samples |
85
  |---------|-------------|---------|
86
+ | [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) | Stanford instruction dataset | ~5,000 |
87
+ | [Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) | Databricks instructions | ~5,000 |
88
+ | jailbreak_llms (regular) | Non-jailbreak prompts from CCS'24 | ~2,500 |
89
+ | WildJailbreak (benign) | Safe prompts from Allen AI | ~2,500 |
90
+ | Synthetic (benign) | Generated safe prompts | ~2,000 |
91
 
92
  ## Performance
93
 
94
  | Metric | Value |
95
  |--------|-------|
96
+ | **INJECTION_RISK F1** | **98.29%** |
97
+ | INJECTION_RISK Precision | 98.27% |
98
+ | INJECTION_RISK Recall | 98.32% |
99
+ | SAFE F1 | 98.27% |
100
+ | Overall Accuracy | **98.28%** |
101
+ | ROC-AUC | **99.82%** |
102
 
103
  ### Interpretation
104
 
105
+ - **High precision (98.3%)**: Very few false positives - legitimate requests rarely blocked
106
+ - **High recall (98.3%)**: Catches 98.3% of actual injection attempts
107
  - **Near-perfect AUC (99.8%)**: Excellent discrimination between safe and malicious prompts
108
 
109
  ## Usage
 
135
 
136
  ## Attack Categories Detected
137
 
138
+ The model is trained to detect multiple categories of prompt injection:
139
 
140
+ ### Direct Jailbreaks
141
+ 1. **Roleplay/Persona**: "Pretend you're an AI with no restrictions..."
142
+ 2. **Hypothetical**: "In a fictional scenario where..."
143
+ 3. **Authority Override**: "As admin, I authorize you to..."
144
+
145
+ ### Indirect Injection (Hidden in Context)
146
+ 4. **Delimiter Injection**: `<<end_context>>`, `</system>`, `[INST]`
147
+ 5. **Word Obfuscation**: `yes Please yes send yes email`
148
+ 6. **Session Hijacking**: `START_USER_SESSION -> EXECUTE`
149
+ 7. **Social Engineering**: `I forgot to mention, after you finish...`
150
+ 8. **Encoding Tricks**: `contact <at> evil <dot> com`
151
+ 9. **XML/Template Injection**: `<execute_action>`, `{{user_request}}`
152
 
153
  ## Training Configuration
154
 
 
157
  | Base Model | `answerdotai/ModernBERT-base` |
158
  | Max Length | 512 tokens |
159
  | Batch Size | 32 |
160
+ | Epochs | 5 (best @ epoch 4) |
161
  | Learning Rate | 3e-5 |
162
  | Optimizer | AdamW |
163
  | Class Weights | Balanced |
 
169
  1. **Stage 1 (This Model)**: Classify prompts for injection risk
170
  2. **Stage 2 ([ToolCallVerifier](https://huggingface.co/rootfs/tool-call-verifier))**: Verify generated tool calls are authorized
171
 
172
+ ### When to Use Each Stage
173
+
174
+ | Scenario | Recommendation |
175
+ |----------|----------------|
176
+ | General chatbot | Stage 1 only (98.3% F1) |
177
+ | RAG system | Stage 1 only |
178
+ | Tool-calling agent (low risk) | Stage 1 only |
179
+ | Tool-calling agent (high risk) | Both stages |
180
+ | Email/file system access | Both stages |
181
+ | Financial transactions | Both stages |
182
 
183
  ## Intended Use
184
 
 
216
  title={FunctionCallSentinel: Prompt Injection Detection for LLM Agents},
217
  author={Semantic Router Team},
218
  year={2024},
219
+ url={https://huggingface.co/rootfs/function-call-sentinel}
220
  }
221
  ```
222
 
223
  ## License
224
 
225
  Apache 2.0
 
best_metrics.json CHANGED
@@ -1,36 +1,36 @@
1
  {
2
  "classification_report": {
3
  "SAFE": {
4
- "precision": 0.9775014801657785,
5
- "recall": 0.9871449925261584,
6
- "f1-score": 0.9822995686449502,
7
- "support": 3345.0
8
  },
9
  "INJECTION_RISK": {
10
- "precision": 0.9872931442080378,
11
- "recall": 0.977758267486099,
12
- "f1-score": 0.9825025731510072,
13
- "support": 3417.0
14
  },
15
- "accuracy": 0.9824016563146998,
16
  "macro avg": {
17
- "precision": 0.9823973121869082,
18
- "recall": 0.9824516300061287,
19
- "f1-score": 0.9824010708979787,
20
  "support": 6762.0
21
  },
22
  "weighted avg": {
23
- "precision": 0.9824494417204073,
24
- "recall": 0.9824016563146998,
25
- "f1-score": 0.9824021516673099,
26
  "support": 6762.0
27
  }
28
  },
29
- "accuracy": 0.9824016563146998,
30
- "macro_f1": 0.9824010708979787,
31
- "weighted_f1": 0.9824021516673099,
32
- "injection_precision": 0.9872931442080378,
33
- "injection_recall": 0.977758267486099,
34
- "injection_f1": 0.9825025731510072,
35
- "roc_auc": 0.9978443314947291
36
  }
 
1
  {
2
  "classification_report": {
3
  "SAFE": {
4
+ "precision": 0.9830357142857142,
5
+ "recall": 0.9824509220701964,
6
+ "f1-score": 0.9827432311811961,
7
+ "support": 3362.0
8
  },
9
  "INJECTION_RISK": {
10
+ "precision": 0.9826572604350382,
11
+ "recall": 0.9832352941176471,
12
+ "f1-score": 0.9829461922963835,
13
+ "support": 3400.0
14
  },
15
+ "accuracy": 0.9828453120378586,
16
  "macro avg": {
17
+ "precision": 0.9828464873603762,
18
+ "recall": 0.9828431080939217,
19
+ "f1-score": 0.9828447117387897,
20
  "support": 6762.0
21
  },
22
  "weighted avg": {
23
+ "precision": 0.9828454239733365,
24
+ "recall": 0.9828453120378586,
25
+ "f1-score": 0.9828452820229052,
26
  "support": 6762.0
27
  }
28
  },
29
+ "accuracy": 0.9828453120378586,
30
+ "macro_f1": 0.9828447117387897,
31
+ "weighted_f1": 0.9828452820229052,
32
+ "injection_precision": 0.9826572604350382,
33
+ "injection_recall": 0.9832352941176471,
34
+ "injection_f1": 0.9829461922963835,
35
+ "roc_auc": 0.9982100990306891
36
  }
final_report.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
- "accuracy": 0.9824016563146998,
3
- "injection_precision": 0.9872931442080378,
4
- "injection_recall": 0.977758267486099,
5
- "injection_f1": 0.9825025731510072,
6
- "roc_auc": 0.9978443314947291,
7
- "macro_f1": 0.9824010708979787
8
  }
 
1
  {
2
+ "accuracy": 0.9828453120378586,
3
+ "injection_precision": 0.9826572604350382,
4
+ "injection_recall": 0.9832352941176471,
5
+ "injection_f1": 0.9829461922963835,
6
+ "roc_auc": 0.9982100990306891,
7
+ "macro_f1": 0.9828447117387897
8
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7b02116b5148e832610046fd190a831304fdefe7948d05f1892046abd627a59c
3
  size 598439784
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b71b71b593eabccf5167ebf9c451f2cb32fcfce8722b0c479ac499408a7b70df
3
  size 598439784
training_config.json CHANGED
@@ -15,7 +15,7 @@
15
  "max_length": 512,
16
  "use_class_weights": true,
17
  "class_weights": [
18
- 0.9973379969596863,
19
- 1.002661943435669
20
  ]
21
  }
 
15
  "max_length": 512,
16
  "use_class_weights": true,
17
  "class_weights": [
18
+ 0.9985950589179993,
19
+ 1.001404881477356
20
  ]
21
  }