File size: 812 Bytes
19767d2
c0277e0
19767d2
c0277e0
19767d2
 
 
 
c0277e0
19767d2
 
c0277e0
19767d2
 
 
c0277e0
19767d2
 
c0277e0
19767d2
 
 
 
c0277e0
19767d2
c0277e0
19767d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Jailbreak Detection Model πŸš€

This model is fine-tuned to detect jailbreak prompts / unsafe instructions.

## πŸ“Š Training Metrics
- **Training steps:** 0
- **Final Training Loss:** N/A
- **Final Eval Loss:** 0.07551019638776779

## πŸ“ˆ Training Curve
![Training Curve](./training_loss.png)

## πŸ›  How to Use
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("hurtmongoose/Jailbreak-Detection-Models")
tokenizer = AutoTokenizer.from_pretrained("hurtmongoose/Jailbreak-Detection-Models")

inputs = tokenizer("This is a test jailbreak prompt", return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits)
πŸ“Œ Notes

Trained on jailbreak detection dataset

Can be improved with more adversarial prompts