| # Jailbreak Detection Model | |
| ## Model Description | |
| This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK. | |
| **Base Model:** microsoft/deberta-v3-small | |
| **Training Dataset:** jackhhao/jailbreak-classification | |
| **Training Date:** 2025-10-16 | |
| ## Performance Metrics | |
| - **Accuracy:** 0.9962 | |
| - **Precision:** 1.0000 | |
| - **Recall:** 0.9928 | |
| - **F1 Score:** 0.9964 | |
| ## Training Details | |
| - Learning Rate: 2e-05 | |
| - Batch Size: 16 | |
| - Epochs: 5 | |
| - Max Length: 512 | |
| - Class Weighting: True | |
| ## Usage | |
| ```python | |
| from transformers import pipeline | |
| classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak") | |
| result = classifier("Your prompt here") | |
| print(result) | |
| ``` | |
| ## Labels | |
| - **BENIGN (0):** Safe, normal prompts | |
| - **JAILBREAK (1):** Potential jailbreak attempts | |
| ## Label Mapping | |
| - Original dataset labels: "benign" -> 0, "jailbreak" -> 1 | |
| ## Limitations | |
| - Model may not detect novel jailbreak techniques | |
| - Performance depends on similarity to training data | |
| - Should be used as part of a layered security approach | |
| ## Training Configuration | |
| { | |
| "learning_rate": 2e-05, | |
| "batch_size": 16, | |
| "num_epochs": 5, | |
| "max_length": 512, | |
| "weight_decay": 0.01, | |
| "warmup_ratio": 0.1 | |
| } | |