File size: 1,242 Bytes
2842916 28a684b 2842916 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# Jailbreak Detection Model
## Model Description
This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.
**Base Model:** microsoft/deberta-v3-small
**Training Dataset:** jackhhao/jailbreak-classification
**Training Date:** 2025-10-16
## Performance Metrics
- **Accuracy:** 0.9962
- **Precision:** 1.0000
- **Recall:** 0.9928
- **F1 Score:** 0.9964
## Training Details
- Learning Rate: 2e-05
- Batch Size: 16
- Epochs: 5
- Max Length: 512
- Class Weighting: True
## Usage
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)
```
## Labels
- **BENIGN (0):** Safe, normal prompts
- **JAILBREAK (1):** Potential jailbreak attempts
## Label Mapping
- Original dataset labels: "benign" -> 0, "jailbreak" -> 1
## Limitations
- Model may not detect novel jailbreak techniques
- Performance depends on similarity to training data
- Should be used as part of a layered security approach
## Training Configuration
{
"learning_rate": 2e-05,
"batch_size": 16,
"num_epochs": 5,
"max_length": 512,
"weight_decay": 0.01,
"warmup_ratio": 0.1
}
|