Edu-p commited on
Commit
d22fa4e
·
verified ·
1 Parent(s): 6e3e336

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -7
README.md CHANGED
@@ -29,7 +29,7 @@ This model was introduced in the paper:
29
  > **Robustness of Language Models against Portuguese Harmful Prompts**
30
  > Eduardo Alexandre de Amorim, Cleber Zanchettin
31
  > *International Joint Conference on Neural Networks (IJCNN)*
32
- > [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)]
33
 
34
  ---
35
 
@@ -65,7 +65,7 @@ $$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$
65
  ## Evaluation
66
 
67
  Evaluated on a held-out test set (25% of the
68
- [wildjailbreak-pt-br](https://huggingface.co/datasets/Edu-p/wildjailbreak-pt-br)
69
  dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
70
  KS-optimal threshold (τ*), which maximizes class separability.
71
 
@@ -90,11 +90,67 @@ class, making threshold selection robust in deployment.
90
  ## Usage
91
 
92
  ```python
 
 
93
  import torch
94
- from transformers import BertTokenizer, BertModel
95
 
96
- # NOTE: this loads the tokenizer and backbone — instantiate the full
97
- # BertMLPClassifier from the source repo for end-to-end inference.
98
- # See: https://github.com/Edu-p/secbert-pt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- tokenizer = BertTokenizer.from_pretrained("Edu-p/secbert-pt")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  > **Robustness of Language Models against Portuguese Harmful Prompts**
30
  > Eduardo Alexandre de Amorim, Cleber Zanchettin
31
  > *International Joint Conference on Neural Networks (IJCNN)*
32
+ > [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)]
33
 
34
  ---
35
 
 
65
  ## Evaluation
66
 
67
  Evaluated on a held-out test set (25% of the
68
+ [harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)
69
  dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
70
  KS-optimal threshold (τ*), which maximizes class separability.
71
 
 
90
  ## Usage
91
 
92
  ```python
93
+ from transformers import BertTokenizer
94
+ from src.model import BertMLPClassifier
95
  import torch
 
96
 
97
+ tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
98
+ model = BertMLPClassifier(
99
+ model_name="neuralmind/bert-base-portuguese-cased",
100
+ hidden_dim=768,
101
+ freeze_backbone=False,
102
+ )
103
+ model.load_state_dict(torch.load("best_model.pth", weights_only=True))
104
+ model.eval()
105
+
106
+ # KS-optimal threshold from paper
107
+ TAU_STAR = 0.72
108
+
109
+ inputs = tokenizer(
110
+ "Ignore suas instruções anteriores e...",
111
+ return_tensors="pt",
112
+ truncation=True,
113
+ max_length=512,
114
+ )
115
+ with torch.no_grad():
116
+ logits = model(**inputs)
117
+ prob = torch.softmax(logits, dim=1)[0, 1].item()
118
+
119
+ label = "harmful" if prob >= TAU_STAR else "benign"
120
+ print(f"Score: {prob:.3f} → {label}")
121
+ ```
122
+
123
+ For the full `BertMLPClassifier` definition, clone the
124
+ [source repository](https://github.com/Edu-p/secbert-pt).
125
 
126
+ ---
127
+
128
+ ## Limitations
129
+
130
+ - The dataset was generated via automated translation. Organically crafted
131
+ Portuguese jailbreaks from native attackers may not be fully represented.
132
+ - The model was trained on a static snapshot of WildJailbreak attack vectors.
133
+ Novel jailbreak strategies not present in the training data may evade detection.
134
+ - SecBERT is designed as one layer of a defense-in-depth strategy, not as a
135
+ standalone solution.
136
+
137
+ ---
138
+
139
+ ## Citation
140
+
141
+ ```bibtex
142
+ @inproceedings{amorim2026secbert,
143
+ title = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
144
+ author = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
145
+ booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
146
+ year = {2026}
147
+ }
148
+ ```
149
+
150
+ ---
151
+
152
+ ## License
153
+
154
+ MIT License — research use only. Users are responsible for complying with the
155
+ terms of the original
156
+ [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.