--- language: en tags: - jailbreak-detection - deberta-v3 - text-classification model-index: - name: predict_qwen2.5_7b-instruct results: - task: type: text-classification name: Jailbreak Detection metrics: - name: F1 type: f1 value: 0.7731 - name: PR-AUC type: pr_auc value: 0.8692 - name: ROC-AUC type: roc_auc value: 0.9591 - name: Precision type: precision value: 0.8364 - name: Recall type: recall value: 0.7188 --- # Jailbreak Prediction Model: qwen2.5:7b-instruct Fine-tuned DeBERTa-v3-base for detecting unsafe/jailbreak prompts in multi-turn conversations. ## Evaluation Results (best fold: 3) | Metric | Value | |----------------|--------| | F1 | 0.7731 | | PR-AUC | 0.8692 | | ROC-AUC | 0.9591 | | Precision | 0.8364 | | Recall | 0.7188 | | Best Threshold | 0.10 | ## Training Details - **Base model**: `microsoft/deberta-v3-base` - **Target model**: `qwen2.5:7b-instruct` - **Datasets**: HarmBench - **K-Folds**: 5 - **Epochs**: 5 - **Learning Rate**: 2e-05 - **Max Length**: 512 - **Input format**: turns only ## Dataset Size (before turn expansion) Original rows (after cleaning and balancing): 1536 (unsafe: 308, safe: 1228)