AIccel_Jailbreak / README.md
traromal's picture
Rename MODEL_CARD.md to README.md
c14157d verified
# Jailbreak Detection Model
## Model Description
This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.
**Base Model:** microsoft/deberta-v3-small
**Training Dataset:** jackhhao/jailbreak-classification
**Training Date:** 2025-10-16
## Performance Metrics
- **Accuracy:** 0.9962
- **Precision:** 1.0000
- **Recall:** 0.9928
- **F1 Score:** 0.9964
## Training Details
- Learning Rate: 2e-05
- Batch Size: 16
- Epochs: 5
- Max Length: 512
- Class Weighting: True
## Usage
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)
```
## Labels
- **BENIGN (0):** Safe, normal prompts
- **JAILBREAK (1):** Potential jailbreak attempts
## Label Mapping
- Original dataset labels: "benign" -> 0, "jailbreak" -> 1
## Limitations
- Model may not detect novel jailbreak techniques
- Performance depends on similarity to training data
- Should be used as part of a layered security approach
## Training Configuration
{
"learning_rate": 2e-05,
"batch_size": 16,
"num_epochs": 5,
"max_length": 512,
"weight_decay": 0.01,
"warmup_ratio": 0.1
}