Jailbreak Prediction Model: llama2:7b (fine-tuned on harmful_behaviors_30)
Incrementally fine-tuned from yonad2008/predict_llama2_7b
on data from harmful_behaviors_30.csv.
Evaluation Results
| Metric | Value |
|---|---|
| F1 | 0.3125 |
| PR-AUC | 0.2708 |
| ROC-AUC | 0.6016 |
| Precision | 0.2000 |
| Recall | 0.7143 |
| Best Threshold | 0.50 |
Fine-Tuning Details
- Base model:
yonad2008/predict_llama2_7b - Target model:
llama2:7b - Dataset:
harmful_behaviors_30.csv - Frozen layers: 9/12 transformer layers
- Epochs: 3
- Learning Rate: 5e-06
- Max Length: 512
- Input format: turns only
Dataset Size (before turn expansion)
Original rows (after cleaning and balancing): 108 (unsafe: 27, safe: 81)
- Downloads last month
- 22
Evaluation results
- F1self-reported0.313
- PR-AUCself-reported0.271
- ROC-AUCself-reported0.602
- Precisionself-reported0.200
- Recallself-reported0.714