--- language: en tags: - gnn - jailbreak-detection - text-classification model-index: - name: predict_gnn_phi4_14b results: - task: type: text-classification name: Jailbreak Detection metrics: - name: F1 type: f1 value: 0.9411 - name: PR-AUC type: pr_auc value: 0.9782 - name: ROC-AUC type: roc_auc value: 0.9593 - name: Precision type: precision value: 0.9682 - name: Recall type: recall value: 0.9163 --- # GNN Jailbreak Prediction Model (phi4:14b) Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations. ## Evaluation Results | Metric | Value | |----------------|--------| | F1 | 0.9411 | | PR-AUC | 0.9782 | | ROC-AUC | 0.9593 | | Precision | 0.9682 | | Recall | 0.9163 | | Best Threshold | 0.270 | ## Training Details - **Target model**: `phi4:14b` - **Datasets**: harmbench, harmful_behaviors_1 - **Split column**: `goal` - **Seed**: `42` - **Sentence model**: `sentence-transformers/all-MiniLM-L6-v2` - **Hidden channels**: `128` - **Num layers**: `2` - **Dropout**: `0.3` ## Dataset Size (training samples) Prepared turn-level samples: 707