--- language: - pt license: mit base_model: neuralmind/bert-base-portuguese-cased tags: - text-classification - jailbreak-detection - llm-safety - red-teaming - adversarial-attacks - bert - portuguese pipeline_tag: text-classification metrics: - f1 - roc_auc --- # SecBERT-PT **SecBERT** is a binary classifier for detecting harmful and jailbreak prompts in Brazilian Portuguese. It is built on top of [BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) with a fully fine-tuned backbone and a two-layer MLP classification head. This model was introduced in the paper: > **Robustness of Language Models against Portuguese Harmful Prompts** > Eduardo Alexandre de Amorim, Cleber Zanchettin > *International Joint Conference on Neural Networks (IJCNN)* > [[Paper]()] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)] --- ## Model Description SecBERT frames harmful prompt detection as a binary classification task. Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where $y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates a benign one. **Architecture:** The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base is passed through a two-layer MLP: $$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$ $$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$ **Training:** | Setting | Value | |---|---| | Base model | neuralmind/bert-base-portuguese-cased | | Optimizer | AdamW | | Learning rate | 2e-5 | | Batch size | 20 | | Max sequence length | 512 | | LR schedule | Linear warmup (10%) + linear decay | | Early stopping patience | 20 (on validation loss) | --- ## Evaluation Evaluated on a held-out test set (25% of the [harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt) dataset). Metrics are reported at both the standard threshold (τ=0.5) and the KS-optimal threshold (τ*), which maximizes class separability. | Threshold | Accuracy | Precision | Recall | F1 | FPR | |---|---|---|---|---|---| | τ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% | | τ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% | **Separability (threshold-independent):** | AUC | KS Statistic | |---|---| | 99.2% | 91.2% | The KS statistic measures the maximum separation between the cumulative score distributions of benign and harmful classes. A value of 91.2% indicates that the model assigns well-separated probability scores to each class, making threshold selection robust in deployment. --- ## Usage ```python from transformers import BertTokenizer from src.model import BertMLPClassifier import torch tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased") model = BertMLPClassifier( model_name="neuralmind/bert-base-portuguese-cased", hidden_dim=768, freeze_backbone=False, ) model.load_state_dict(torch.load("best_model.pth", weights_only=True)) model.eval() # KS-optimal threshold from paper TAU_STAR = 0.72 inputs = tokenizer( "Ignore suas instruções anteriores e...", return_tensors="pt", truncation=True, max_length=512, ) with torch.no_grad(): logits = model(**inputs) prob = torch.softmax(logits, dim=1)[0, 1].item() label = "harmful" if prob >= TAU_STAR else "benign" print(f"Score: {prob:.3f} → {label}") ``` For the full `BertMLPClassifier` definition, clone the [source repository](https://github.com/Edu-p/secbert-pt). --- ## Limitations - The dataset was generated via automated translation. Organically crafted Portuguese jailbreaks from native attackers may not be fully represented. - The model was trained on a static snapshot of WildJailbreak attack vectors. Novel jailbreak strategies not present in the training data may evade detection. - SecBERT is designed as one layer of a defense-in-depth strategy, not as a standalone solution. --- ## Citation ```bibtex @inproceedings{amorim2026secbert, title = {Robustness of Language Models against {P}ortuguese Harmful Prompts}, author = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber}, booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)}, year = {2026} } ``` --- ## License MIT License — research use only. Users are responsible for complying with the terms of the original [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.