--- language: en tags: - jailbreak-detection - deberta-v3 - text-classification model-index: - name: predict_llama3.2_3b results: - task: type: text-classification name: Jailbreak Detection metrics: - name: F1 type: f1 value: 0.7216 - name: PR-AUC type: pr_auc value: 0.7712 - name: ROC-AUC type: roc_auc value: 0.9199 - name: Precision type: precision value: 0.6306 - name: Recall type: recall value: 0.8434 --- # Jailbreak Prediction Model: llama3.2:3b Fine-tuned DeBERTa-v3-base for detecting unsafe/jailbreak prompts in multi-turn conversations. ## Evaluation Results (best fold: 1) | Metric | Value | |----------------|--------| | F1 | 0.7216 | | PR-AUC | 0.7712 | | ROC-AUC | 0.9199 | | Precision | 0.6306 | | Recall | 0.8434 | | Best Threshold | 0.20 | ## Training Details - **Base model**: `microsoft/deberta-v3-base` - **Target model**: `llama3.2:3b` - **Datasets**: HarmBench - **K-Folds**: 5 - **Epochs**: 5 - **Learning Rate**: 2e-05 - **Max Length**: 512 - **Input format**: turns only ## Dataset Size (before turn expansion) Original rows (after cleaning and balancing): 2096 (unsafe: 401, safe: 1695)