GNN Jailbreak Prediction Model (phi4:14b)
Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results
| Metric | Value |
|---|---|
| F1 | 0.9411 |
| PR-AUC | 0.9782 |
| ROC-AUC | 0.9593 |
| Precision | 0.9682 |
| Recall | 0.9163 |
| Best Threshold | 0.270 |
Training Details
- Target model:
phi4:14b - Datasets: harmbench, harmful_behaviors_1
- Split column:
goal - Seed:
42 - Sentence model:
sentence-transformers/all-MiniLM-L6-v2 - Hidden channels:
128 - Num layers:
2 - Dropout:
0.3
Dataset Size (training samples)
Prepared turn-level samples: 707
Evaluation results
- F1self-reported0.941
- PR-AUCself-reported0.978
- ROC-AUCself-reported0.959
- Precisionself-reported0.968
- Recallself-reported0.916