modernbert-jailbreak-guard
This repository hosts a sequence classification model fine-tuned to act as a pre-inference security gate for Large Language Model applications. Its core task is to inspect incoming user prompts and classify them as benign or jailbreak with minimal latency.
The architecture leverages answerdotai/ModernBERT-base as its core encoder backbone and attaches a sequence classification head optimized over full-parameter weight tuning on a T4 GPU.
Test Set Performance
Evaluated against an independent test split, this guardrail achieves strong metrics:
- Overall Model Accuracy: 98.47%
- Precision (Jailbreak): 97.87%
- Recall (Jailbreak): 99.28% (Successfully intercepted 138 out of 139 malicious inputs)
- F1-Score (Jailbreak): 98.57%
- Macro F1-Score: 98.47%
- False Negative Rate: 0.72% (Critical leaks minimized to under 1%)
- False Positive Rate: 2.44% (Maintains smooth user experience with minimal false blocks)
Confusion Matrix Breakdown
- True Negatives (Safe prompts allowed seamlessly): 120
- True Positives (Malicious attacks neutralized): 138
- False Positives (Safe prompts accidentally blocked): 3
- False Negatives (Malicious payloads leaked): 1
Intended Uses and Limitations
Intended Deployment Design
This model is intended to run as a Pre-Inference Gateway Shield. It intercepts raw string requests coming from client user interfaces before they are routed to generative backends like GPT-4, Llama 3, or Claude.
Limitations and Strategy
- Input-Only Scope: This gateway model does not monitor outgoing text generated by the core model. It needs to be coupled with an independent post-inference output alignment model to monitor for hallucinations or data leaks.
- Defense in Depth: While highly robust, it should represent one tier of a holistic security layout including input vector blacklists and runtime system prompts.
Training and Evaluation Data
The model was fine-tuned on the balanced split of the jackhhao/jailbreak-classification dataset.
- Training Size: 1,044 examples
- Evaluation Size: 262 examples
- Label Mapping: benign (0), jailbreak (1)
The dataset features a balanced distribution of classic adversarial templates, context-switching overrides, hypothetical roleplay scripts, and standard conversational strings.
Training Procedure
The model was fine-tuned using the Hugging Face Trainer library on a single cloud-hosted NVIDIA Tesla T4 GPU.
Framework and Optimization Settings
- Optimizer: ADAMW_TORCH_FUSED (Accelerated hardware optimization kernel)
- Precision: Mixed-precision training enabled via BF16=True (Brain Floating Point)
- Sequence Processing: Token unpadding active to dynamically strip empty PAD tensors from memory allocation blocks.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 16
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 4
Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision Jailbreak | Recall Jailbreak | F1 Jailbreak | Macro F1 | True Negatives | False Positives | False Negatives | True Positives | False Negative Rate | False Positive Rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0587 | 1.0 | 131 | 0.0544 | 0.9847 | 0.9787 | 0.9928 | 0.9857 | 0.9847 | 120 | 3 | 1 | 138 | 0.0072 | 0.0244 |
| 0.0027 | 2.0 | 262 | 0.0306 | 0.9885 | 0.9857 | 0.9928 | 0.9892 | 0.9885 | 121 | 2 | 1 | 138 | 0.0072 | 0.0163 |
| 0.0001 | 3.0 | 393 | 0.0265 | 0.9924 | 0.9928 | 0.9928 | 0.9928 | 0.9923 | 122 | 1 | 1 | 138 | 0.0072 | 0.0081 |
| 0.0000 | 4.0 | 524 | 0.0266 | 0.9962 | 1.0 | 0.9928 | 0.9964 | 0.9962 | 123 | 0 | 1 | 138 | 0.0072 | 0.0 |
Framework versions
- Transformers 5.10.0.dev0
- Pytorch 2.11.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
- Downloads last month
- -
Model tree for jsayyar04/modernbert-jailbreak-guard
Base model
answerdotai/ModernBERT-baseDataset used to train jsayyar04/modernbert-jailbreak-guard
Space using jsayyar04/modernbert-jailbreak-guard 1
Evaluation results
- accuracy on jailbreak-classificationself-reported0.996