File size: 1,242 Bytes


# Jailbreak Detection Model

## Model Description
This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.

**Base Model:** microsoft/deberta-v3-small
**Training Dataset:** jackhhao/jailbreak-classification
**Training Date:** 2025-10-16

## Performance Metrics
- **Accuracy:** 0.9962
- **Precision:** 1.0000
- **Recall:** 0.9928
- **F1 Score:** 0.9964

## Training Details
- Learning Rate: 2e-05
- Batch Size: 16
- Epochs: 5
- Max Length: 512
- Class Weighting: True

## Usage
```python
from transformers import pipeline

classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)
```

## Labels
- **BENIGN (0):** Safe, normal prompts
- **JAILBREAK (1):** Potential jailbreak attempts

## Label Mapping
- Original dataset labels: "benign" -> 0, "jailbreak" -> 1

## Limitations
- Model may not detect novel jailbreak techniques
- Performance depends on similarity to training data
- Should be used as part of a layered security approach

## Training Configuration
{
  "learning_rate": 2e-05,
  "batch_size": 16,
  "num_epochs": 5,
  "max_length": 512,
  "weight_decay": 0.01,
  "warmup_ratio": 0.1
}