metadata
language: en
tags:
- jailbreak-detection
- deberta-v3
- text-classification
- fine-tuned
model-index:
- name: predict_zephyr_7b-beta_harmful_behaviors_1
results:
- task:
type: text-classification
name: Jailbreak Detection
metrics:
- name: F1
type: f1
value: 0.2699
- name: PR-AUC
type: pr_auc
value: 0.3947
- name: ROC-AUC
type: roc_auc
value: 0.6368
- name: Precision
type: precision
value: 0.1618
- name: Recall
type: recall
value: 0.8148
Jailbreak Prediction Model: zephyr:7b-beta (fine-tuned on harmful_behaviors_1)
Incrementally fine-tuned from yonad2008/predict_zephyr_7b-beta
on data from harmful_behaviors_1.csv.
Evaluation Results
| Metric | Value |
|---|---|
| F1 | 0.2699 |
| PR-AUC | 0.3947 |
| ROC-AUC | 0.6368 |
| Precision | 0.1618 |
| Recall | 0.8148 |
| Best Threshold | 0.35 |
Fine-Tuning Details
- Base model:
yonad2008/predict_zephyr_7b-beta - Target model:
zephyr:7b-beta - Dataset:
harmful_behaviors_1.csv - Frozen layers: 9/12 transformer layers
- Epochs: 3
- Learning Rate: 5e-06
- Max Length: 512
- Input format: turns only
Dataset Size (before turn expansion)
Original rows (after cleaning and balancing): 296 (unsafe: 74, safe: 222)