| --- |
| language: |
| - pt |
| license: mit |
| base_model: neuralmind/bert-base-portuguese-cased |
| tags: |
| - text-classification |
| - jailbreak-detection |
| - llm-safety |
| - red-teaming |
| - adversarial-attacks |
| - bert |
| - portuguese |
| pipeline_tag: text-classification |
| metrics: |
| - f1 |
| - roc_auc |
| --- |
| |
| # SecBERT-PT |
|
|
| **SecBERT** is a binary classifier for detecting harmful and jailbreak prompts |
| in Brazilian Portuguese. It is built on top of |
| [BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) |
| with a fully fine-tuned backbone and a two-layer MLP classification head. |
|
|
| This model was introduced in the paper: |
|
|
| > **Robustness of Language Models against Portuguese Harmful Prompts** |
| > Eduardo Alexandre de Amorim, Cleber Zanchettin |
| > *International Joint Conference on Neural Networks (IJCNN)* |
| > [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)] |
|
|
| --- |
|
|
| ## Model Description |
|
|
| SecBERT frames harmful prompt detection as a binary classification task. |
| Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where |
| $y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates |
| a benign one. |
|
|
| **Architecture:** |
|
|
| The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base |
| is passed through a two-layer MLP: |
| |
| $$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$ |
| $$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$ |
|
|
| **Training:** |
|
|
| | Setting | Value | |
| |---|---| |
| | Base model | neuralmind/bert-base-portuguese-cased | |
| | Optimizer | AdamW | |
| | Learning rate | 2e-5 | |
| | Batch size | 20 | |
| | Max sequence length | 512 | |
| | LR schedule | Linear warmup (10%) + linear decay | |
| | Early stopping patience | 20 (on validation loss) | |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Evaluated on a held-out test set (25% of the |
| [harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt) |
| dataset). Metrics are reported at both the standard threshold (τ=0.5) and the |
| KS-optimal threshold (τ*), which maximizes class separability. |
| |
| | Threshold | Accuracy | Precision | Recall | F1 | FPR | |
| |---|---|---|---|---|---| |
| | τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% | |
| | τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% | |
|
|
| **Separability (threshold-independent):** |
|
|
| | AUC | KS Statistic | |
| |---|---| |
| | 99.2% | 91.2% | |
|
|
| The KS statistic measures the maximum separation between the cumulative |
| score distributions of benign and harmful classes. A value of 91.2% |
| indicates that the model assigns well-separated probability scores to each |
| class, making threshold selection robust in deployment. |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import BertTokenizer |
| from src.model import BertMLPClassifier |
| import torch |
| |
| tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased") |
| model = BertMLPClassifier( |
| model_name="neuralmind/bert-base-portuguese-cased", |
| hidden_dim=768, |
| freeze_backbone=False, |
| ) |
| model.load_state_dict(torch.load("best_model.pth", weights_only=True)) |
| model.eval() |
| |
| # KS-optimal threshold from paper |
| TAU_STAR = 0.72 |
| |
| inputs = tokenizer( |
| "Ignore suas instruções anteriores e...", |
| return_tensors="pt", |
| truncation=True, |
| max_length=512, |
| ) |
| with torch.no_grad(): |
| logits = model(**inputs) |
| prob = torch.softmax(logits, dim=1)[0, 1].item() |
| |
| label = "harmful" if prob >= TAU_STAR else "benign" |
| print(f"Score: {prob:.3f} → {label}") |
| ``` |
|
|
| For the full `BertMLPClassifier` definition, clone the |
| [source repository](https://github.com/Edu-p/secbert-pt). |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - The dataset was generated via automated translation. Organically crafted |
| Portuguese jailbreaks from native attackers may not be fully represented. |
| - The model was trained on a static snapshot of WildJailbreak attack vectors. |
| Novel jailbreak strategies not present in the training data may evade detection. |
| - SecBERT is designed as one layer of a defense-in-depth strategy, not as a |
| standalone solution. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{amorim2026secbert, |
| title = {Robustness of Language Models against {P}ortuguese Harmful Prompts}, |
| author = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber}, |
| booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)}, |
| year = {2026} |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| MIT License — research use only. Users are responsible for complying with the |
| terms of the original |
| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset. |
|
|