File size: 4,512 Bytes
6e3e336 d22fa4e 6e3e336 d22fa4e 6e3e336 d22fa4e 6e3e336 d22fa4e 6e3e336 d22fa4e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ---
language:
- pt
license: mit
base_model: neuralmind/bert-base-portuguese-cased
tags:
- text-classification
- jailbreak-detection
- llm-safety
- red-teaming
- adversarial-attacks
- bert
- portuguese
pipeline_tag: text-classification
metrics:
- f1
- roc_auc
---
# SecBERT-PT
**SecBERT** is a binary classifier for detecting harmful and jailbreak prompts
in Brazilian Portuguese. It is built on top of
[BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)
with a fully fine-tuned backbone and a two-layer MLP classification head.
This model was introduced in the paper:
> **Robustness of Language Models against Portuguese Harmful Prompts**
> Eduardo Alexandre de Amorim, Cleber Zanchettin
> *International Joint Conference on Neural Networks (IJCNN)*
> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)]
---
## Model Description
SecBERT frames harmful prompt detection as a binary classification task.
Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where
$y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates
a benign one.
**Architecture:**
The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base
is passed through a two-layer MLP:
$$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$
$$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$
**Training:**
| Setting | Value |
|---|---|
| Base model | neuralmind/bert-base-portuguese-cased |
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Batch size | 20 |
| Max sequence length | 512 |
| LR schedule | Linear warmup (10%) + linear decay |
| Early stopping patience | 20 (on validation loss) |
---
## Evaluation
Evaluated on a held-out test set (25% of the
[harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)
dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
KS-optimal threshold (τ*), which maximizes class separability.
| Threshold | Accuracy | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% |
| τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% |
**Separability (threshold-independent):**
| AUC | KS Statistic |
|---|---|
| 99.2% | 91.2% |
The KS statistic measures the maximum separation between the cumulative
score distributions of benign and harmful classes. A value of 91.2%
indicates that the model assigns well-separated probability scores to each
class, making threshold selection robust in deployment.
---
## Usage
```python
from transformers import BertTokenizer
from src.model import BertMLPClassifier
import torch
tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = BertMLPClassifier(
model_name="neuralmind/bert-base-portuguese-cased",
hidden_dim=768,
freeze_backbone=False,
)
model.load_state_dict(torch.load("best_model.pth", weights_only=True))
model.eval()
# KS-optimal threshold from paper
TAU_STAR = 0.72
inputs = tokenizer(
"Ignore suas instruções anteriores e...",
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs)
prob = torch.softmax(logits, dim=1)[0, 1].item()
label = "harmful" if prob >= TAU_STAR else "benign"
print(f"Score: {prob:.3f} → {label}")
```
For the full `BertMLPClassifier` definition, clone the
[source repository](https://github.com/Edu-p/secbert-pt).
---
## Limitations
- The dataset was generated via automated translation. Organically crafted
Portuguese jailbreaks from native attackers may not be fully represented.
- The model was trained on a static snapshot of WildJailbreak attack vectors.
Novel jailbreak strategies not present in the training data may evade detection.
- SecBERT is designed as one layer of a defense-in-depth strategy, not as a
standalone solution.
---
## Citation
```bibtex
@inproceedings{amorim2026secbert,
title = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
author = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
year = {2026}
}
```
---
## License
MIT License — research use only. Users are responsible for complying with the
terms of the original
[WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.
|