# Jailbreak Detection Model ## Model Description This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK. **Base Model:** microsoft/deberta-v3-small **Training Dataset:** jackhhao/jailbreak-classification **Training Date:** 2025-10-16 ## Performance Metrics - **Accuracy:** 0.9962 - **Precision:** 1.0000 - **Recall:** 0.9928 - **F1 Score:** 0.9964 ## Training Details - Learning Rate: 2e-05 - Batch Size: 16 - Epochs: 5 - Max Length: 512 - Class Weighting: True ## Usage ```python from transformers import pipeline classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak") result = classifier("Your prompt here") print(result) ``` ## Labels - **BENIGN (0):** Safe, normal prompts - **JAILBREAK (1):** Potential jailbreak attempts ## Label Mapping - Original dataset labels: "benign" -> 0, "jailbreak" -> 1 ## Limitations - Model may not detect novel jailbreak techniques - Performance depends on similarity to training data - Should be used as part of a layered security approach ## Training Configuration { "learning_rate": 2e-05, "batch_size": 16, "num_epochs": 5, "max_length": 512, "weight_decay": 0.01, "warmup_ratio": 0.1 }