| language: en | |
| tags: | |
| - jailbreak-detection | |
| - deberta-v3 | |
| - text-classification | |
| model-index: | |
| - name: predict_zephyr_7b-beta | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Jailbreak Detection | |
| metrics: | |
| - name: F1 | |
| type: f1 | |
| value: 0.8592 | |
| - name: PR-AUC | |
| type: pr_auc | |
| value: 0.9080 | |
| - name: ROC-AUC | |
| type: roc_auc | |
| value: 0.9757 | |
| - name: Precision | |
| type: precision | |
| value: 0.8841 | |
| - name: Recall | |
| type: recall | |
| value: 0.8356 | |
| # Jailbreak Prediction Model: zephyr:7b-beta | |
| Fine-tuned DeBERTa-v3-base for detecting unsafe/jailbreak prompts in multi-turn conversations. | |
| ## Evaluation Results (best fold: 3) | |
| | Metric | Value | | |
| |----------------|--------| | |
| | F1 | 0.8592 | | |
| | PR-AUC | 0.9080 | | |
| | ROC-AUC | 0.9757 | | |
| | Precision | 0.8841 | | |
| | Recall | 0.8356 | | |
| | Best Threshold | 0.10 | | |
| ## Training Details | |
| - **Base model**: `microsoft/deberta-v3-base` | |
| - **Target model**: `zephyr:7b-beta` | |
| - **Datasets**: HarmBench | |
| - **K-Folds**: 5 | |
| - **Epochs**: 5 | |
| - **Learning Rate**: 2e-05 | |
| - **Max Length**: 512 | |
| - **Input format**: turns only | |
| ## Dataset Size (before turn expansion) | |
| Original rows (after cleaning and balancing): 1730 (unsafe: 336, safe: 1394) | |