secbert-pt / README.md
Edu-p's picture
Update README.md
d22fa4e verified
---
language:
- pt
license: mit
base_model: neuralmind/bert-base-portuguese-cased
tags:
- text-classification
- jailbreak-detection
- llm-safety
- red-teaming
- adversarial-attacks
- bert
- portuguese
pipeline_tag: text-classification
metrics:
- f1
- roc_auc
---
# SecBERT-PT
**SecBERT** is a binary classifier for detecting harmful and jailbreak prompts
in Brazilian Portuguese. It is built on top of
[BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)
with a fully fine-tuned backbone and a two-layer MLP classification head.
This model was introduced in the paper:
> **Robustness of Language Models against Portuguese Harmful Prompts**
> Eduardo Alexandre de Amorim, Cleber Zanchettin
> *International Joint Conference on Neural Networks (IJCNN)*
> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)]
---
## Model Description
SecBERT frames harmful prompt detection as a binary classification task.
Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where
$y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates
a benign one.
**Architecture:**
The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base
is passed through a two-layer MLP:
$$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$
$$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$
**Training:**
| Setting | Value |
|---|---|
| Base model | neuralmind/bert-base-portuguese-cased |
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Batch size | 20 |
| Max sequence length | 512 |
| LR schedule | Linear warmup (10%) + linear decay |
| Early stopping patience | 20 (on validation loss) |
---
## Evaluation
Evaluated on a held-out test set (25% of the
[harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)
dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
KS-optimal threshold (τ*), which maximizes class separability.
| Threshold | Accuracy | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% |
| τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% |
**Separability (threshold-independent):**
| AUC | KS Statistic |
|---|---|
| 99.2% | 91.2% |
The KS statistic measures the maximum separation between the cumulative
score distributions of benign and harmful classes. A value of 91.2%
indicates that the model assigns well-separated probability scores to each
class, making threshold selection robust in deployment.
---
## Usage
```python
from transformers import BertTokenizer
from src.model import BertMLPClassifier
import torch
tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = BertMLPClassifier(
model_name="neuralmind/bert-base-portuguese-cased",
hidden_dim=768,
freeze_backbone=False,
)
model.load_state_dict(torch.load("best_model.pth", weights_only=True))
model.eval()
# KS-optimal threshold from paper
TAU_STAR = 0.72
inputs = tokenizer(
"Ignore suas instruções anteriores e...",
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs)
prob = torch.softmax(logits, dim=1)[0, 1].item()
label = "harmful" if prob >= TAU_STAR else "benign"
print(f"Score: {prob:.3f} → {label}")
```
For the full `BertMLPClassifier` definition, clone the
[source repository](https://github.com/Edu-p/secbert-pt).
---
## Limitations
- The dataset was generated via automated translation. Organically crafted
Portuguese jailbreaks from native attackers may not be fully represented.
- The model was trained on a static snapshot of WildJailbreak attack vectors.
Novel jailbreak strategies not present in the training data may evade detection.
- SecBERT is designed as one layer of a defense-in-depth strategy, not as a
standalone solution.
---
## Citation
```bibtex
@inproceedings{amorim2026secbert,
title = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
author = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
year = {2026}
}
```
---
## License
MIT License — research use only. Users are responsible for complying with the
terms of the original
[WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.