Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -15,7 +15,7 @@ Fine-tuned from [**jhu-clsp/mmBERT-base**](https://huggingface.co/jhu-clsp/mmBER
|
|
| 15 |
This model outputs two scores: `prompt_injection` (index 0) and `toxic` (index 1). A **tiered detection strategy** combines both heads to achieve higher recall than a single PI threshold alone.
|
| 16 |
|
| 17 |
**Usage:**
|
| 18 |
-
For a single text input, tokenize and split into overlapping chunks of ≤512 tokens (overlap=
|
| 19 |
|
| 20 |
---
|
| 21 |
|
|
@@ -37,8 +37,8 @@ flag = (pi >= pi_thresh) OR (pi >= pi_lower_bound AND toxic >= toxic_thresh)
|
|
| 37 |
|
| 38 |
| Dataset | Recall | FPR |
|
| 39 |
|:--------|-------:|----:|
|
| 40 |
-
| test (262K) | 43.
|
| 41 |
-
| customer_test (1.4M) |
|
| 42 |
|
| 43 |
### Thresholds at 0.5% FPR
|
| 44 |
|
|
@@ -50,8 +50,8 @@ flag = (pi >= pi_thresh) OR (pi >= pi_lower_bound AND toxic >= toxic_thresh)
|
|
| 50 |
|
| 51 |
| Dataset | Recall | FPR |
|
| 52 |
|:--------|-------:|----:|
|
| 53 |
-
| test (262K) | 70.
|
| 54 |
-
| customer_test (1.4M) |
|
| 55 |
|
| 56 |
### Thresholds at 1% FPR
|
| 57 |
|
|
@@ -63,8 +63,8 @@ flag = (pi >= pi_thresh) OR (pi >= pi_lower_bound AND toxic >= toxic_thresh)
|
|
| 63 |
|
| 64 |
| Dataset | Recall | FPR |
|
| 65 |
|:--------|-------:|----:|
|
| 66 |
-
| test (262K) | 75.
|
| 67 |
-
| customer_test (1.4M) |
|
| 68 |
|
| 69 |
### Thresholds for POV
|
| 70 |
|
|
@@ -76,8 +76,8 @@ flag = (pi >= pi_thresh) OR (pi >= pi_lower_bound AND toxic >= toxic_thresh)
|
|
| 76 |
|
| 77 |
| Dataset | Recall | FPR |
|
| 78 |
|:--------|-------:|----:|
|
| 79 |
-
| test (262K) | 97.
|
| 80 |
-
| customer_test (1.4M) | 94.
|
| 81 |
|
| 82 |
---
|
| 83 |
|
|
@@ -85,9 +85,9 @@ flag = (pi >= pi_thresh) OR (pi >= pi_lower_bound AND toxic >= toxic_thresh)
|
|
| 85 |
|
| 86 |
| Test FPR | pi-mmbert-v2 Recall | pi-mmbert-v3.5 Recall | Δ |
|
| 87 |
|:---------|--------------------:|----------------------:|--:|
|
| 88 |
-
| 0.
|
| 89 |
-
| 0.
|
| 90 |
-
|
|
| 91 |
|
| 92 |
> v2 is a single PI-head model; v3.5 adds a toxic head with tiered detection. v2 thresholds calibrated on test benign to match v3.5 FPR exactly.
|
| 93 |
|
|
@@ -127,8 +127,8 @@ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
|
| 127 |
|
| 128 |
# --- Inference parameters ---
|
| 129 |
max_length = 512
|
| 130 |
-
chunk_overlap =
|
| 131 |
-
stride = max_length - chunk_overlap #
|
| 132 |
|
| 133 |
# --- Tiered thresholds (0.1% FPR — PI-only, no tier rescue) ---
|
| 134 |
# pi_thresh = 0.996
|
|
@@ -142,7 +142,7 @@ toxic_thresh = 0.98
|
|
| 142 |
# pi_thresh = 0.982
|
| 143 |
# pi_lower_bound = 0.5
|
| 144 |
# toxic_thresh = 0.90
|
| 145 |
-
# --- Thresholds for POV (test: recall=97.
|
| 146 |
# pi_thresh = 0.29
|
| 147 |
# pi_lower_bound = 0.10
|
| 148 |
# toxic_thresh = 0.89
|
|
|
|
| 15 |
This model outputs two scores: `prompt_injection` (index 0) and `toxic` (index 1). A **tiered detection strategy** combines both heads to achieve higher recall than a single PI threshold alone.
|
| 16 |
|
| 17 |
**Usage:**
|
| 18 |
+
For a single text input, tokenize and split into overlapping chunks of ≤512 tokens (overlap=100, stride=412), run them in a batch, and take the **maximum logit across chunks** per head before applying sigmoid. Apply the tiered rule to the resulting PI and toxic probabilities.
|
| 19 |
|
| 20 |
---
|
| 21 |
|
|
|
|
| 37 |
|
| 38 |
| Dataset | Recall | FPR |
|
| 39 |
|:--------|-------:|----:|
|
| 40 |
+
| test (262K) | 43.70% | 0.136% |
|
| 41 |
+
| customer_test (1.4M) | 44.28% | 0.612% |
|
| 42 |
|
| 43 |
### Thresholds at 0.5% FPR
|
| 44 |
|
|
|
|
| 50 |
|
| 51 |
| Dataset | Recall | FPR |
|
| 52 |
|:--------|-------:|----:|
|
| 53 |
+
| test (262K) | 70.32% | 0.608% |
|
| 54 |
+
| customer_test (1.4M) | 70.11% | 2.474% |
|
| 55 |
|
| 56 |
### Thresholds at 1% FPR
|
| 57 |
|
|
|
|
| 63 |
|
| 64 |
| Dataset | Recall | FPR |
|
| 65 |
|:--------|-------:|----:|
|
| 66 |
+
| test (262K) | 75.00% | 0.992% |
|
| 67 |
+
| customer_test (1.4M) | 73.10% | 2.550% |
|
| 68 |
|
| 69 |
### Thresholds for POV
|
| 70 |
|
|
|
|
| 76 |
|
| 77 |
| Dataset | Recall | FPR |
|
| 78 |
|:--------|-------:|----:|
|
| 79 |
+
| test (262K) | 97.33% | 9.281% |
|
| 80 |
+
| customer_test (1.4M) | 94.83% | 6.268% |
|
| 81 |
|
| 82 |
---
|
| 83 |
|
|
|
|
| 85 |
|
| 86 |
| Test FPR | pi-mmbert-v2 Recall | pi-mmbert-v3.5 Recall | Δ |
|
| 87 |
|:---------|--------------------:|----------------------:|--:|
|
| 88 |
+
| 0.136% | 35.31% | **43.70%** | +8.39pp |
|
| 89 |
+
| 0.608% | 60.46% | **70.32%** | +9.86pp |
|
| 90 |
+
| 0.992% | 67.13% | **75.00%** | +7.87pp |
|
| 91 |
|
| 92 |
> v2 is a single PI-head model; v3.5 adds a toxic head with tiered detection. v2 thresholds calibrated on test benign to match v3.5 FPR exactly.
|
| 93 |
|
|
|
|
| 127 |
|
| 128 |
# --- Inference parameters ---
|
| 129 |
max_length = 512
|
| 130 |
+
chunk_overlap = 100
|
| 131 |
+
stride = max_length - chunk_overlap # 412
|
| 132 |
|
| 133 |
# --- Tiered thresholds (0.1% FPR — PI-only, no tier rescue) ---
|
| 134 |
# pi_thresh = 0.996
|
|
|
|
| 142 |
# pi_thresh = 0.982
|
| 143 |
# pi_lower_bound = 0.5
|
| 144 |
# toxic_thresh = 0.90
|
| 145 |
+
# --- Thresholds for POV (test: recall=97.33%, FPR=9.281%) ---
|
| 146 |
# pi_thresh = 0.29
|
| 147 |
# pi_lower_bound = 0.10
|
| 148 |
# toxic_thresh = 0.89
|