--- language: en tags: - jailbreak-detection - deberta-v3 - text-classification model-index: - name: predict_llama2_7b results: - task: type: text-classification name: Jailbreak Detection metrics: - name: F1 type: f1 value: 0.9120 - name: PR-AUC type: pr_auc value: 0.9434 - name: ROC-AUC type: roc_auc value: 0.9705 - name: Precision type: precision value: 0.9048 - name: Recall type: recall value: 0.9194 --- # Jailbreak Prediction Model: llama2:7b Fine-tuned DeBERTa-v3-base for detecting unsafe/jailbreak prompts in multi-turn conversations. ## Evaluation Results (best fold: 2) | Metric | Value | |----------------|--------| | F1 | 0.9120 | | PR-AUC | 0.9434 | | ROC-AUC | 0.9705 | | Precision | 0.9048 | | Recall | 0.9194 | | Best Threshold | 0.35 | ## Training Details - **Base model**: `microsoft/deberta-v3-base` - **Target model**: `llama2:7b` - **Datasets**: HarmBench - **K-Folds**: 5 - **Epochs**: 5 - **Learning Rate**: 2e-05 - **Max Length**: 512 - **Input format**: turns only