Edu-p
/

secbert-pt

@@ -29,7 +29,7 @@ This model was introduced in the paper:
 > **Robustness of Language Models against Portuguese Harmful Prompts**
 > Eduardo Alexandre de Amorim, Cleber Zanchettin
 > *International Joint Conference on Neural Networks (IJCNN)*
-> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)]
 ---
@@ -65,7 +65,7 @@ $$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$
 ## Evaluation
 Evaluated on a held-out test set (25% of the
-[wildjailbreak-pt-br](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)
 dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
 KS-optimal threshold (τ*), which maximizes class separability.
@@ -90,11 +90,67 @@ class, making threshold selection robust in deployment.
 ## Usage
 ```python
 import torch
-from transformers import BertTokenizer, BertModel
-# NOTE: this loads the tokenizer and backbone — instantiate the full
-# BertMLPClassifier from the source repo for end-to-end inference.
-# See: https://github.com/Edu-p/secbert-pt
-tokenizer = BertTokenizer.from_pretrained("Edu-p/secbert-pt")

 > **Robustness of Language Models against Portuguese Harmful Prompts**
 > Eduardo Alexandre de Amorim, Cleber Zanchettin
 > *International Joint Conference on Neural Networks (IJCNN)*
+> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)]
 ---
 ## Evaluation
 Evaluated on a held-out test set (25% of the
+[harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)
 dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
 KS-optimal threshold (τ*), which maximizes class separability.
 ## Usage
 ```python
+from transformers import BertTokenizer
+from src.model import BertMLPClassifier
 import torch
+tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
+model = BertMLPClassifier(
+    model_name="neuralmind/bert-base-portuguese-cased",
+    hidden_dim=768,
+    freeze_backbone=False,
+)
+model.load_state_dict(torch.load("best_model.pth", weights_only=True))
+model.eval()
+# KS-optimal threshold from paper
+TAU_STAR = 0.72
+inputs = tokenizer(
+    "Ignore suas instruções anteriores e...",
+    return_tensors="pt",
+    truncation=True,
+    max_length=512,
+)
+with torch.no_grad():
+    logits = model(**inputs)
+    prob = torch.softmax(logits, dim=1)[0, 1].item()
+label = "harmful" if prob >= TAU_STAR else "benign"
+print(f"Score: {prob:.3f} → {label}")
+```
+For the full `BertMLPClassifier` definition, clone the
+[source repository](https://github.com/Edu-p/secbert-pt).
+---
+## Limitations
+- The dataset was generated via automated translation. Organically crafted
+  Portuguese jailbreaks from native attackers may not be fully represented.
+- The model was trained on a static snapshot of WildJailbreak attack vectors.
+  Novel jailbreak strategies not present in the training data may evade detection.
+- SecBERT is designed as one layer of a defense-in-depth strategy, not as a
+  standalone solution.
+---
+## Citation
+```bibtex
+@inproceedings{amorim2026secbert,
+  title     = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
+  author    = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
+  booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
+  year      = {2026}
+}
+```
+---
+## License
+MIT License — research use only. Users are responsible for complying with the
+terms of the original
+[WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.