Jailbreak Detection Model
Model Description
This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.
Base Model: microsoft/deberta-v3-small Training Dataset: jackhhao/jailbreak-classification Training Date: 2025-10-16
Performance Metrics
- Accuracy: 0.9962
- Precision: 1.0000
- Recall: 0.9928
- F1 Score: 0.9964
Training Details
- Learning Rate: 2e-05
- Batch Size: 16
- Epochs: 5
- Max Length: 512
- Class Weighting: True
Usage
from transformers import pipeline
classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)
Labels
- BENIGN (0): Safe, normal prompts
- JAILBREAK (1): Potential jailbreak attempts
Label Mapping
- Original dataset labels: "benign" -> 0, "jailbreak" -> 1
Limitations
- Model may not detect novel jailbreak techniques
- Performance depends on similarity to training data
- Should be used as part of a layered security approach
Training Configuration
{ "learning_rate": 2e-05, "batch_size": 16, "num_epochs": 5, "max_length": 512, "weight_decay": 0.01, "warmup_ratio": 0.1 }