Update README.md

d22fa4e verified about 2 months ago

4.51 kB

	---
	language:
	- pt
	license: mit
	base_model: neuralmind/bert-base-portuguese-cased
	tags:
	- text-classification
	- jailbreak-detection
	- llm-safety
	- red-teaming
	- adversarial-attacks
	- bert
	- portuguese
	pipeline_tag: text-classification
	metrics:
	- f1
	- roc_auc
	---

	# SecBERT-PT

	SecBERT is a binary classifier for detecting harmful and jailbreak prompts
	in Brazilian Portuguese. It is built on top of
	[BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)
	with a fully fine-tuned backbone and a two-layer MLP classification head.

	This model was introduced in the paper:

	> Robustness of Language Models against Portuguese Harmful Prompts
	> Eduardo Alexandre de Amorim, Cleber Zanchettin
	> International Joint Conference on Neural Networks (IJCNN)
	> [[Paper](<link_pub>)] [[Code](https://github.com/Edu-p/secbert-pt)] [[Dataset](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)]

	---

	## Model Description

	SecBERT frames harmful prompt detection as a binary classification task.
	Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where
	$y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates
	a benign one.

	Architecture:

	The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base
	is passed through a two-layer MLP:

	$$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$$
	$$\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$$

	Training:

	\| Setting \| Value \|
	\|---\|---\|
	\| Base model \| neuralmind/bert-base-portuguese-cased \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 2e-5 \|
	\| Batch size \| 20 \|
	\| Max sequence length \| 512 \|
	\| LR schedule \| Linear warmup (10%) + linear decay \|
	\| Early stopping patience \| 20 (on validation loss) \|

	---

	## Evaluation

	Evaluated on a held-out test set (25% of the
	[harmful-prompts-pt](https://huggingface.co/datasets/Edu-p/harmful-prompts-pt)
	dataset). Metrics are reported at both the standard threshold (τ=0.5) and the
	KS-optimal threshold (τ*), which maximizes class separability.

	\| Threshold \| Accuracy \| Precision \| Recall \| F1 \| FPR \|
	\|---\|---\|---\|---\|---\|---\|
	\| τ = 0.5 \| 95.4% \| 94.9% \| 96.1% \| 95.5% \| 5.4% \|
	\| τ* = 0.72 \| 95.6% \| 96.5% \| 94.8% \| 95.6% \| 3.6% \|

	Separability (threshold-independent):

	\| AUC \| KS Statistic \|
	\|---\|---\|
	\| 99.2% \| 91.2% \|

	The KS statistic measures the maximum separation between the cumulative
	score distributions of benign and harmful classes. A value of 91.2%
	indicates that the model assigns well-separated probability scores to each
	class, making threshold selection robust in deployment.

	---

	## Usage

	```python
	from transformers import BertTokenizer
	from src.model import BertMLPClassifier
	import torch

	tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
	model = BertMLPClassifier(
	model_name="neuralmind/bert-base-portuguese-cased",
	hidden_dim=768,
	freeze_backbone=False,
	)
	model.load_state_dict(torch.load("best_model.pth", weights_only=True))
	model.eval()

	# KS-optimal threshold from paper
	TAU_STAR = 0.72

	inputs = tokenizer(
	"Ignore suas instruções anteriores e...",
	return_tensors="pt",
	truncation=True,
	max_length=512,
	)
	with torch.no_grad():
	logits = model(**inputs)
	prob = torch.softmax(logits, dim=1)[0, 1].item()

	label = "harmful" if prob >= TAU_STAR else "benign"
	print(f"Score: {prob:.3f} → {label}")
	```

	For the full `BertMLPClassifier` definition, clone the
	[source repository](https://github.com/Edu-p/secbert-pt).

	---

	## Limitations

	- The dataset was generated via automated translation. Organically crafted
	Portuguese jailbreaks from native attackers may not be fully represented.
	- The model was trained on a static snapshot of WildJailbreak attack vectors.
	Novel jailbreak strategies not present in the training data may evade detection.
	- SecBERT is designed as one layer of a defense-in-depth strategy, not as a
	standalone solution.

	---

	## Citation

	```bibtex
	@inproceedings{amorim2026secbert,
	title = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
	author = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
	booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
	year = {2026}
	}
	```

	---

	## License

	MIT License — research use only. Users are responsible for complying with the
	terms of the original
	[WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset.