File size: 4,512 Bytes

---
language:
- pt
license: mit
base_model: neuralmind/bert-base-portuguese-cased
tags:
- text-classification
- jailbreak-detection
- llm-safety
- red-teaming
- adversarial-attacks
- bert
- portuguese
pipeline_tag: text-classification
metrics:
- f1
- roc_auc
---

# SecBERT-PT

**SecBERT** is a binary classifier for detecting harmful and jailbreak prompts
in Brazilian Portuguese. It is built on top of
[BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)
with a fully fine-tuned backbone and a two-layer MLP classification head.

This model was introduced in the paper:

> **Robustness of Language Models against Portuguese Harmful Prompts**  
> Eduardo Alexandre de Amorim, Cleber Zanchettin  
> *International Joint Conference on Neural Networks (IJCNN)*  
> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)]

---

## Model Description

SecBERT frames harmful prompt detection as a binary classification task.
Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where
$y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates
a benign one.

**Architecture:**

The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base
is passed through a two-layer MLP:

$$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$
$$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$

**Training:**

| Setting | Value |
|---|---|
| Base model | neuralmind/bert-base-portuguese-cased |
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Batch size | 20 |
| Max sequence length | 512 |
| LR schedule | Linear warmup (10%) + linear decay |
| Early stopping patience | 20 (on validation loss) |

---

## Evaluation

Evaluated on a held-out test set (25% of the
[harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)
dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
KS-optimal threshold (τ*), which maximizes class separability.

| Threshold | Accuracy | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% |
| τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% |

**Separability (threshold-independent):**

| AUC | KS Statistic |
|---|---|
| 99.2% | 91.2% |

The KS statistic measures the maximum separation between the cumulative
score distributions of benign and harmful classes. A value of 91.2%
indicates that the model assigns well-separated probability scores to each
class, making threshold selection robust in deployment.

---

## Usage

```python
from transformers import BertTokenizer
from src.model import BertMLPClassifier
import torch

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = BertMLPClassifier(
    model_name="neuralmind/bert-base-portuguese-cased",
    hidden_dim=768,
    freeze_backbone=False,
)
model.load_state_dict(torch.load("best_model.pth", weights_only=True))
model.eval()

# KS-optimal threshold from paper
TAU_STAR = 0.72

inputs = tokenizer(
    "Ignore suas instruções anteriores e...",
    return_tensors="pt",
    truncation=True,
    max_length=512,
)
with torch.no_grad():
    logits = model(**inputs)
    prob = torch.softmax(logits, dim=1)[0, 1].item()

label = "harmful" if prob >= TAU_STAR else "benign"
print(f"Score: {prob:.3f} → {label}")
```

For the full `BertMLPClassifier` definition, clone the
[source repository](https://github.com/Edu-p/secbert-pt).

---

## Limitations

- The dataset was generated via automated translation. Organically crafted
  Portuguese jailbreaks from native attackers may not be fully represented.
- The model was trained on a static snapshot of WildJailbreak attack vectors.
  Novel jailbreak strategies not present in the training data may evade detection.
- SecBERT is designed as one layer of a defense-in-depth strategy, not as a
  standalone solution.

---

## Citation

```bibtex
@inproceedings{amorim2026secbert,
  title     = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
  author    = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
  booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
  year      = {2026}
}
```

---

## License

MIT License — research use only. Users are responsible for complying with the
terms of the original
[WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.