modernbert-jailbreak-guard

This repository hosts a sequence classification model fine-tuned to act as a pre-inference security gate for Large Language Model applications. Its core task is to inspect incoming user prompts and classify them as benign or jailbreak with minimal latency.

The architecture leverages answerdotai/ModernBERT-base as its core encoder backbone and attaches a sequence classification head optimized over full-parameter weight tuning on a T4 GPU.

Test Set Performance

Evaluated against an independent test split, this guardrail achieves strong metrics:

Overall Model Accuracy: 98.47%
Precision (Jailbreak): 97.87%
Recall (Jailbreak): 99.28% (Successfully intercepted 138 out of 139 malicious inputs)
F1-Score (Jailbreak): 98.57%
Macro F1-Score: 98.47%
False Negative Rate: 0.72% (Critical leaks minimized to under 1%)
False Positive Rate: 2.44% (Maintains smooth user experience with minimal false blocks)

Confusion Matrix Breakdown

True Negatives (Safe prompts allowed seamlessly): 120
True Positives (Malicious attacks neutralized): 138
False Positives (Safe prompts accidentally blocked): 3
False Negatives (Malicious payloads leaked): 1

Intended Uses and Limitations

Intended Deployment Design

This model is intended to run as a Pre-Inference Gateway Shield. It intercepts raw string requests coming from client user interfaces before they are routed to generative backends like GPT-4, Llama 3, or Claude.

Limitations and Strategy

Input-Only Scope: This gateway model does not monitor outgoing text generated by the core model. It needs to be coupled with an independent post-inference output alignment model to monitor for hallucinations or data leaks.
Defense in Depth: While highly robust, it should represent one tier of a holistic security layout including input vector blacklists and runtime system prompts.

Training and Evaluation Data

The model was fine-tuned on the balanced split of the jackhhao/jailbreak-classification dataset.

Training Size: 1,044 examples
Evaluation Size: 262 examples
Label Mapping: benign (0), jailbreak (1)

The dataset features a balanced distribution of classic adversarial templates, context-switching overrides, hypothetical roleplay scripts, and standard conversational strings.

Training Procedure

The model was fine-tuned using the Hugging Face Trainer library on a single cloud-hosted NVIDIA Tesla T4 GPU.

Framework and Optimization Settings

Optimizer: ADAMW_TORCH_FUSED (Accelerated hardware optimization kernel)
Precision: Mixed-precision training enabled via BF16=True (Brain Floating Point)
Sequence Processing: Token unpadding active to dynamically strip empty PAD tensors from memory allocation blocks.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 16
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 4

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	Precision Jailbreak	Recall Jailbreak	F1 Jailbreak	Macro F1	True Negatives	False Positives	False Negatives	True Positives	False Negative Rate	False Positive Rate
0.0587	1.0	131	0.0544	0.9847	0.9787	0.9928	0.9857	0.9847	120	3	1	138	0.0072	0.0244
0.0027	2.0	262	0.0306	0.9885	0.9857	0.9928	0.9892	0.9885	121	2	1	138	0.0072	0.0163
0.0001	3.0	393	0.0265	0.9924	0.9928	0.9928	0.9928	0.9923	122	1	1	138	0.0072	0.0081
0.0000	4.0	524	0.0266	0.9962	1.0	0.9928	0.9964	0.9962	123	0	1	138	0.0072	0.0

Framework versions

Transformers 5.10.0.dev0
Pytorch 2.11.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for jsayyar04/modernbert-jailbreak-guard

Base model

answerdotai/ModernBERT-base

Finetuned

(1311)

this model

Dataset used to train jsayyar04/modernbert-jailbreak-guard

Space using jsayyar04/modernbert-jailbreak-guard 1

Evaluation results

accuracy on jailbreak-classification
self-reported

0.996