File size: 4,512 Bytes
6e3e336
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d22fa4e
6e3e336
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d22fa4e
6e3e336
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d22fa4e
 
6e3e336
 
d22fa4e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e3e336
d22fa4e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
language:
- pt
license: mit
base_model: neuralmind/bert-base-portuguese-cased
tags:
- text-classification
- jailbreak-detection
- llm-safety
- red-teaming
- adversarial-attacks
- bert
- portuguese
pipeline_tag: text-classification
metrics:
- f1
- roc_auc
---

# SecBERT-PT

**SecBERT** is a binary classifier for detecting harmful and jailbreak prompts
in Brazilian Portuguese. It is built on top of
[BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)
with a fully fine-tuned backbone and a two-layer MLP classification head.

This model was introduced in the paper:

> **Robustness of Language Models against Portuguese Harmful Prompts**  
> Eduardo Alexandre de Amorim, Cleber Zanchettin  
> *International Joint Conference on Neural Networks (IJCNN)*  
> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)]

---

## Model Description

SecBERT frames harmful prompt detection as a binary classification task.
Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where
$y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates
a benign one.

**Architecture:**

The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base
is passed through a two-layer MLP:

$$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$
$$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$

**Training:**

| Setting | Value |
|---|---|
| Base model | neuralmind/bert-base-portuguese-cased |
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Batch size | 20 |
| Max sequence length | 512 |
| LR schedule | Linear warmup (10%) + linear decay |
| Early stopping patience | 20 (on validation loss) |

---

## Evaluation

Evaluated on a held-out test set (25% of the
[harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)
dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
KS-optimal threshold (τ*), which maximizes class separability.

| Threshold | Accuracy | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% |
| τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% |

**Separability (threshold-independent):**

| AUC | KS Statistic |
|---|---|
| 99.2% | 91.2% |

The KS statistic measures the maximum separation between the cumulative
score distributions of benign and harmful classes. A value of 91.2%
indicates that the model assigns well-separated probability scores to each
class, making threshold selection robust in deployment.

---

## Usage

```python
from transformers import BertTokenizer
from src.model import BertMLPClassifier
import torch

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = BertMLPClassifier(
    model_name="neuralmind/bert-base-portuguese-cased",
    hidden_dim=768,
    freeze_backbone=False,
)
model.load_state_dict(torch.load("best_model.pth", weights_only=True))
model.eval()

# KS-optimal threshold from paper
TAU_STAR = 0.72

inputs = tokenizer(
    "Ignore suas instruções anteriores e...",
    return_tensors="pt",
    truncation=True,
    max_length=512,
)
with torch.no_grad():
    logits = model(**inputs)
    prob = torch.softmax(logits, dim=1)[0, 1].item()

label = "harmful" if prob >= TAU_STAR else "benign"
print(f"Score: {prob:.3f} → {label}")
```

For the full `BertMLPClassifier` definition, clone the
[source repository](https://github.com/Edu-p/secbert-pt).

---

## Limitations

- The dataset was generated via automated translation. Organically crafted
  Portuguese jailbreaks from native attackers may not be fully represented.
- The model was trained on a static snapshot of WildJailbreak attack vectors.
  Novel jailbreak strategies not present in the training data may evade detection.
- SecBERT is designed as one layer of a defense-in-depth strategy, not as a
  standalone solution.

---

## Citation

```bibtex
@inproceedings{amorim2026secbert,
  title     = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
  author    = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
  booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
  year      = {2026}
}
```

---

## License

MIT License — research use only. Users are responsible for complying with the
terms of the original
[WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.