modernbert-jailbreak-guard

This repository hosts a sequence classification model fine-tuned to act as a pre-inference security gate for Large Language Model applications. Its core task is to inspect incoming user prompts and classify them as benign or jailbreak with minimal latency.

The architecture leverages answerdotai/ModernBERT-base as its core encoder backbone and attaches a sequence classification head optimized over full-parameter weight tuning on a T4 GPU.

Test Set Performance

Evaluated against an independent test split, this guardrail achieves strong metrics:

  • Overall Model Accuracy: 98.47%
  • Precision (Jailbreak): 97.87%
  • Recall (Jailbreak): 99.28% (Successfully intercepted 138 out of 139 malicious inputs)
  • F1-Score (Jailbreak): 98.57%
  • Macro F1-Score: 98.47%
  • False Negative Rate: 0.72% (Critical leaks minimized to under 1%)
  • False Positive Rate: 2.44% (Maintains smooth user experience with minimal false blocks)

Confusion Matrix Breakdown

  • True Negatives (Safe prompts allowed seamlessly): 120
  • True Positives (Malicious attacks neutralized): 138
  • False Positives (Safe prompts accidentally blocked): 3
  • False Negatives (Malicious payloads leaked): 1

Intended Uses and Limitations

Intended Deployment Design

This model is intended to run as a Pre-Inference Gateway Shield. It intercepts raw string requests coming from client user interfaces before they are routed to generative backends like GPT-4, Llama 3, or Claude.

Limitations and Strategy

  • Input-Only Scope: This gateway model does not monitor outgoing text generated by the core model. It needs to be coupled with an independent post-inference output alignment model to monitor for hallucinations or data leaks.
  • Defense in Depth: While highly robust, it should represent one tier of a holistic security layout including input vector blacklists and runtime system prompts.

Training and Evaluation Data

The model was fine-tuned on the balanced split of the jackhhao/jailbreak-classification dataset.

  • Training Size: 1,044 examples
  • Evaluation Size: 262 examples
  • Label Mapping: benign (0), jailbreak (1)

The dataset features a balanced distribution of classic adversarial templates, context-switching overrides, hypothetical roleplay scripts, and standard conversational strings.

Training Procedure

The model was fine-tuned using the Hugging Face Trainer library on a single cloud-hosted NVIDIA Tesla T4 GPU.

Framework and Optimization Settings

  • Optimizer: ADAMW_TORCH_FUSED (Accelerated hardware optimization kernel)
  • Precision: Mixed-precision training enabled via BF16=True (Brain Floating Point)
  • Sequence Processing: Token unpadding active to dynamically strip empty PAD tensors from memory allocation blocks.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 4

Training results

Training Loss Epoch Step Validation Loss Accuracy Precision Jailbreak Recall Jailbreak F1 Jailbreak Macro F1 True Negatives False Positives False Negatives True Positives False Negative Rate False Positive Rate
0.0587 1.0 131 0.0544 0.9847 0.9787 0.9928 0.9857 0.9847 120 3 1 138 0.0072 0.0244
0.0027 2.0 262 0.0306 0.9885 0.9857 0.9928 0.9892 0.9885 121 2 1 138 0.0072 0.0163
0.0001 3.0 393 0.0265 0.9924 0.9928 0.9928 0.9928 0.9923 122 1 1 138 0.0072 0.0081
0.0000 4.0 524 0.0266 0.9962 1.0 0.9928 0.9964 0.9962 123 0 1 138 0.0072 0.0

Framework versions

  • Transformers 5.10.0.dev0
  • Pytorch 2.11.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jsayyar04/modernbert-jailbreak-guard

Finetuned
(1311)
this model

Dataset used to train jsayyar04/modernbert-jailbreak-guard

Space using jsayyar04/modernbert-jailbreak-guard 1

Evaluation results