Edu-p commited on
Commit
6e3e336
·
verified ·
1 Parent(s): 3e04f3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ license: mit
5
+ base_model: neuralmind/bert-base-portuguese-cased
6
+ tags:
7
+ - text-classification
8
+ - jailbreak-detection
9
+ - llm-safety
10
+ - red-teaming
11
+ - adversarial-attacks
12
+ - bert
13
+ - portuguese
14
+ pipeline_tag: text-classification
15
+ metrics:
16
+ - f1
17
+ - roc_auc
18
+ ---
19
+
20
+ # SecBERT-PT
21
+
22
+ **SecBERT** is a binary classifier for detecting harmful and jailbreak prompts
23
+ in Brazilian Portuguese. It is built on top of
24
+ [BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)
25
+ with a fully fine-tuned backbone and a two-layer MLP classification head.
26
+
27
+ This model was introduced in the paper:
28
+
29
+ > **Robustness of Language Models against Portuguese Harmful Prompts**
30
+ > Eduardo Alexandre de Amorim, Cleber Zanchettin
31
+ > *International Joint Conference on Neural Networks (IJCNN)*
32
+ > [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)]
33
+
34
+ ---
35
+
36
+ ## Model Description
37
+
38
+ SecBERT frames harmful prompt detection as a binary classification task.
39
+ Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where
40
+ $y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates
41
+ a benign one.
42
+
43
+ **Architecture:**
44
+
45
+ The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base
46
+ is passed through a two-layer MLP:
47
+
48
+ $$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$
49
+ $$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$
50
+
51
+ **Training:**
52
+
53
+ | Setting | Value |
54
+ |---|---|
55
+ | Base model | neuralmind/bert-base-portuguese-cased |
56
+ | Optimizer | AdamW |
57
+ | Learning rate | 2e-5 |
58
+ | Batch size | 20 |
59
+ | Max sequence length | 512 |
60
+ | LR schedule | Linear warmup (10%) + linear decay |
61
+ | Early stopping patience | 20 (on validation loss) |
62
+
63
+ ---
64
+
65
+ ## Evaluation
66
+
67
+ Evaluated on a held-out test set (25% of the
68
+ [wildjailbreak-pt-br](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)
69
+ dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
70
+ KS-optimal threshold (τ*), which maximizes class separability.
71
+
72
+ | Threshold | Accuracy | Precision | Recall | F1 | FPR |
73
+ |---|---|---|---|---|---|
74
+ | τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% |
75
+ | τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% |
76
+
77
+ **Separability (threshold-independent):**
78
+
79
+ | AUC | KS Statistic |
80
+ |---|---|
81
+ | 99.2% | 91.2% |
82
+
83
+ The KS statistic measures the maximum separation between the cumulative
84
+ score distributions of benign and harmful classes. A value of 91.2%
85
+ indicates that the model assigns well-separated probability scores to each
86
+ class, making threshold selection robust in deployment.
87
+
88
+ ---
89
+
90
+ ## Usage
91
+
92
+ ```python
93
+ import torch
94
+ from transformers import BertTokenizer, BertModel
95
+
96
+ # NOTE: this loads the tokenizer and backbone — instantiate the full
97
+ # BertMLPClassifier from the source repo for end-to-end inference.
98
+ # See: https://github.com/Edu-p/secbert-pt
99
+
100
+ tokenizer = BertTokenizer.from_pretrained("Edu-p/secbert-pt")