| language: en | |
| tags: | |
| - gnn | |
| - jailbreak-detection | |
| - text-classification | |
| model-index: | |
| - name: predict_gnn_phi4_14b | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Jailbreak Detection | |
| metrics: | |
| - name: F1 | |
| type: f1 | |
| value: 0.9411 | |
| - name: PR-AUC | |
| type: pr_auc | |
| value: 0.9782 | |
| - name: ROC-AUC | |
| type: roc_auc | |
| value: 0.9593 | |
| - name: Precision | |
| type: precision | |
| value: 0.9682 | |
| - name: Recall | |
| type: recall | |
| value: 0.9163 | |
| # GNN Jailbreak Prediction Model (phi4:14b) | |
| Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations. | |
| ## Evaluation Results | |
| | Metric | Value | | |
| |----------------|--------| | |
| | F1 | 0.9411 | | |
| | PR-AUC | 0.9782 | | |
| | ROC-AUC | 0.9593 | | |
| | Precision | 0.9682 | | |
| | Recall | 0.9163 | | |
| | Best Threshold | 0.270 | | |
| ## Training Details | |
| - **Target model**: `phi4:14b` | |
| - **Datasets**: harmbench, harmful_behaviors_1 | |
| - **Split column**: `goal` | |
| - **Seed**: `42` | |
| - **Sentence model**: `sentence-transformers/all-MiniLM-L6-v2` | |
| - **Hidden channels**: `128` | |
| - **Num layers**: `2` | |
| - **Dropout**: `0.3` | |
| ## Dataset Size (training samples) | |
| Prepared turn-level samples: 707 | |