Edu-p
/

secbert-pt

+---
+language:
+- pt
+license: mit
+base_model: neuralmind/bert-base-portuguese-cased
+tags:
+- text-classification
+- jailbreak-detection
+- llm-safety
+- red-teaming
+- adversarial-attacks
+- bert
+- portuguese
+pipeline_tag: text-classification
+metrics:
+- f1
+- roc_auc
+---
+# SecBERT-PT
+**SecBERT** is a binary classifier for detecting harmful and jailbreak prompts
+in Brazilian Portuguese. It is built on top of
+[BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)
+with a fully fine-tuned backbone and a two-layer MLP classification head.
+This model was introduced in the paper:
+> **Robustness of Language Models against Portuguese Harmful Prompts**
+> Eduardo Alexandre de Amorim, Cleber Zanchettin
+> *International Joint Conference on Neural Networks (IJCNN)*
+> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)]
+---
+## Model Description
+SecBERT frames harmful prompt detection as a binary classification task.
+Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where
+$y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates
+a benign one.
+**Architecture:**
+The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base
+is passed through a two-layer MLP:
+$$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$
+$$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$
+**Training:**
+| Setting | Value |
+|---|---|
+| Base model | neuralmind/bert-base-portuguese-cased |
+| Optimizer | AdamW |
+| Learning rate | 2e-5 |
+| Batch size | 20 |
+| Max sequence length | 512 |
+| LR schedule | Linear warmup (10%) + linear decay |
+| Early stopping patience | 20 (on validation loss) |
+---
+## Evaluation
+Evaluated on a held-out test set (25% of the
+[wildjailbreak-pt-br](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)
+dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
+KS-optimal threshold (τ*), which maximizes class separability.
+| Threshold | Accuracy | Precision | Recall | F1 | FPR |
+|---|---|---|---|---|---|
+| τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% |
+| τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% |
+**Separability (threshold-independent):**
+| AUC | KS Statistic |
+|---|---|
+| 99.2% | 91.2% |
+The KS statistic measures the maximum separation between the cumulative
+score distributions of benign and harmful classes. A value of 91.2%
+indicates that the model assigns well-separated probability scores to each
+class, making threshold selection robust in deployment.
+---
+## Usage
+```python
+import torch
+from transformers import BertTokenizer, BertModel
+# NOTE: this loads the tokenizer and backbone — instantiate the full
+# BertMLPClassifier from the source repo for end-to-end inference.
+# See: https://github.com/Edu-p/secbert-pt
+tokenizer = BertTokenizer.from_pretrained("Edu-p/secbert-pt")